Broken routing on remote vanilla kubernetes #214

sbor23 · 2022-10-17T12:58:39Z

What happened?

After running gefyra with the following commands, the routing from the local docker container to the cluster doesn't work.

gefyra up --endpoint 10.33.129.51:31820
gefyra run -i artifactory.xxx.xxx/docker-virtual/dataserver-web -N dataserver-web -n dataserver --env-from deployment/lab-dataserver-web
gefyra bridge -N dataserver-web -C dataserver-web -n dataserver -p 8000:8000 -I dataserver-bridge --deployment lab-dataserver-web

The django container cannot finish startup because postgres is not reachable. But also internet-facing tasks like running a apt update don't work, so the routing in general is not working.

Waiting for PostgreSQL to become available...
  This is taking longer than expected. The following exception may be indicative of an unrecoverable error: 'could not connect to server: Connection timed out
        Is the server running on host "lab-dataserver-web-postgresql" (10.233.55.18) and accepting
        TCP/IP connections on port 5432?

When investigating the cargo container, the following routing was found:

root@125466b336fa:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0
172.29.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1

No route to wg0 is suspicious. We tried adding manually adding a route to the cluster under 10.233.0.0, but even that would not resolve the issue.

root@125466b336fa:/# ip route add 10.233.0.0/16 dev wg0
root@125466b336fa:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
10.233.0.0      0.0.0.0         255.255.0.0     U     0      0        0 wg0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0
172.29.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1
root@125466b336fa:/# ping 10.233.55.18
PING 10.233.55.18 (10.233.55.18) 56(84) bytes of data.
^C
--- 10.233.55.18 ping statistics ---
36 packets transmitted, 0 received, 100% packet loss, time 35834ms

It seems like the wg0 config on k8s is broken as well.

What did you expect to happen?

Cluster/namespace internal services to be reachable after gefyra run and gefyra bridge, as well as public internet services such as Ubuntu package mirrors.

How can we reproduce it (as minimally and precisely as possible)?

Not sure if this is specific to our k8s setup, which is a self-hosted private k8s. Notably the networking stack is using Calico

What Kubernetes setup are you working with?

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.7", GitCommit:"42c05a547468804b2053ecf60a3bd15560362fc2", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:41Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux l01sflalnxrds02 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

SteinRobert · 2022-10-17T21:48:34Z

Hey @sbor23 thanks for the issue! I believe we have seen something similar here:
#126
The package manager wasn't able to download stuff from the internet as well as other strange network behaviour occurred. Setting the MTU correctly when running gefyra up did the trick then.
What is the correct MTU value then? This seems a bit dependent on your network setup. Could you please experiment with the --wireguard-mtu flag in gefyra up?

https://gefyra.dev/reference/cli/#up

@Schille you have any other idea or input on this one?

sbor23 · 2022-10-18T14:07:47Z

Thanks for the hint @SteinRobert . I played around a little bit with the MTU but it didn't change anything.

Also, I found that I can ping the wireguard server using the tunnel IP, so ping 192.168.99.1 worked. We digged a bit more using tcpdump and found that client-side (cargo) routing seems to be working correctly.
We can see packages going in on the cargo side and packages getting out on the stoaway/k8s side. However the packages don't reach the target ports and thus there is no reply.

So ATM this looks like a problem with our k8s networking stack, maybe related to Calico. Will investigate and post if we find something else for future reference.

SteinRobert · 2022-10-18T14:38:06Z

Awesome! Thanks for the feedback so far! We're looking forward to hear more about it!

SteinRobert · 2022-11-02T21:18:48Z

@sbor23 we experienced a similar behaviour in one of our environments. For our case we have been able to solve it and released the solution in 0.13.1.
It was handled in #236
Could you please update and try again for your setup? Please make sure you run gefyra down - since the change happened in the Stowaway which is removed through this command.

sbor23 · 2022-11-03T13:52:23Z

Can confirm the problem is solved in 0.13.1
Thanks a lot!

sbor23 added the bug 🐛 Something isn't working label Oct 17, 2022

SteinRobert added triage To be investigated and removed bug 🐛 Something isn't working labels Oct 19, 2022

sbor23 closed this as completed Nov 3, 2022

sbor23 mentioned this issue Feb 17, 2023

bridge command broken, cannot create InterceptRequest: 404 page not found #350

Closed

sbor23 mentioned this issue Sep 19, 2023

Gefyra v2 test feedback #462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken routing on remote vanilla kubernetes #214

Broken routing on remote vanilla kubernetes #214

sbor23 commented Oct 17, 2022

SteinRobert commented Oct 17, 2022 •

edited

Loading

sbor23 commented Oct 18, 2022

SteinRobert commented Oct 18, 2022

SteinRobert commented Nov 2, 2022 •

edited

Loading

sbor23 commented Nov 3, 2022

Broken routing on remote vanilla kubernetes #214

Broken routing on remote vanilla kubernetes #214

Comments

sbor23 commented Oct 17, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

What Kubernetes setup are you working with?

OS version

Anything else we need to know?

SteinRobert commented Oct 17, 2022 • edited Loading

sbor23 commented Oct 18, 2022

SteinRobert commented Oct 18, 2022

SteinRobert commented Nov 2, 2022 • edited Loading

sbor23 commented Nov 3, 2022

SteinRobert commented Oct 17, 2022 •

edited

Loading

SteinRobert commented Nov 2, 2022 •

edited

Loading