Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken routing on remote vanilla kubernetes #214

Closed
sbor23 opened this issue Oct 17, 2022 · 5 comments
Closed

Broken routing on remote vanilla kubernetes #214

sbor23 opened this issue Oct 17, 2022 · 5 comments
Labels
triage To be investigated

Comments

@sbor23
Copy link

sbor23 commented Oct 17, 2022

What happened?

After running gefyra with the following commands, the routing from the local docker container to the cluster doesn't work.

gefyra up --endpoint 10.33.129.51:31820
gefyra run -i artifactory.xxx.xxx/docker-virtual/dataserver-web -N dataserver-web -n dataserver --env-from deployment/lab-dataserver-web
gefyra bridge -N dataserver-web -C dataserver-web -n dataserver -p 8000:8000 -I dataserver-bridge --deployment lab-dataserver-web

The django container cannot finish startup because postgres is not reachable. But also internet-facing tasks like running a apt update don't work, so the routing in general is not working.

Waiting for PostgreSQL to become available...
  This is taking longer than expected. The following exception may be indicative of an unrecoverable error: 'could not connect to server: Connection timed out
        Is the server running on host "lab-dataserver-web-postgresql" (10.233.55.18) and accepting
        TCP/IP connections on port 5432?

When investigating the cargo container, the following routing was found:

root@125466b336fa:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0
172.29.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1

No route to wg0 is suspicious. We tried adding manually adding a route to the cluster under 10.233.0.0, but even that would not resolve the issue.

root@125466b336fa:/# ip route add 10.233.0.0/16 dev wg0
root@125466b336fa:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         172.17.0.1      0.0.0.0         UG    0      0        0 eth0
10.233.0.0      0.0.0.0         255.255.0.0     U     0      0        0 wg0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0
172.29.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1
root@125466b336fa:/# ping 10.233.55.18
PING 10.233.55.18 (10.233.55.18) 56(84) bytes of data.
^C
--- 10.233.55.18 ping statistics ---
36 packets transmitted, 0 received, 100% packet loss, time 35834ms

It seems like the wg0 config on k8s is broken as well.

What did you expect to happen?

Cluster/namespace internal services to be reachable after gefyra run and gefyra bridge, as well as public internet services such as Ubuntu package mirrors.

How can we reproduce it (as minimally and precisely as possible)?

Not sure if this is specific to our k8s setup, which is a self-hosted private k8s. Notably the networking stack is using Calico

What Kubernetes setup are you working with?

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.7", GitCommit:"42c05a547468804b2053ecf60a3bd15560362fc2", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:41Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux l01sflalnxrds02 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Anything else we need to know?

No response

@sbor23 sbor23 added the bug 🐛 Something isn't working label Oct 17, 2022
@SteinRobert
Copy link
Contributor

SteinRobert commented Oct 17, 2022

Hey @sbor23 thanks for the issue! I believe we have seen something similar here:
#126
The package manager wasn't able to download stuff from the internet as well as other strange network behaviour occurred. Setting the MTU correctly when running gefyra up did the trick then.
What is the correct MTU value then? This seems a bit dependent on your network setup. Could you please experiment with the --wireguard-mtu flag in gefyra up?

https://gefyra.dev/reference/cli/#up

@Schille you have any other idea or input on this one?

@sbor23
Copy link
Author

sbor23 commented Oct 18, 2022

Thanks for the hint @SteinRobert . I played around a little bit with the MTU but it didn't change anything.

Also, I found that I can ping the wireguard server using the tunnel IP, so ping 192.168.99.1 worked. We digged a bit more using tcpdump and found that client-side (cargo) routing seems to be working correctly.
We can see packages going in on the cargo side and packages getting out on the stoaway/k8s side. However the packages don't reach the target ports and thus there is no reply.

So ATM this looks like a problem with our k8s networking stack, maybe related to Calico. Will investigate and post if we find something else for future reference.

@SteinRobert
Copy link
Contributor

Awesome! Thanks for the feedback so far! We're looking forward to hear more about it!

@SteinRobert SteinRobert added triage To be investigated and removed bug 🐛 Something isn't working labels Oct 19, 2022
@SteinRobert
Copy link
Contributor

SteinRobert commented Nov 2, 2022

@sbor23 we experienced a similar behaviour in one of our environments. For our case we have been able to solve it and released the solution in 0.13.1.
It was handled in #236
Could you please update and try again for your setup? Please make sure you run gefyra down - since the change happened in the Stowaway which is removed through this command.

@sbor23
Copy link
Author

sbor23 commented Nov 3, 2022

Can confirm the problem is solved in 0.13.1
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage To be investigated
Projects
None yet
Development

No branches or pull requests

2 participants