Some antrea-agent pods stuck in termination state for around 15 minutes #625

tnqn · 2020-04-17T14:24:54Z

Thanks @alex-vmw for reporting and helping troubleshoot this issue. Quoting @alex-vmw's report and revised a few details.

Describe the bug
When rolling update antrea-agent daemonset, it happened several times (almost every time on one cluster) that some antrea-agent pods stuck in termination state for around 15 minutes, then it recovered automatically. Some observation and analysis as below:

The issues continues to disproportionately happen on master nodes where apiserver is running.
antrea-agent pod is deleted on the node, but kubelet is NOT able to update/delete the status of the POD in Kubernetes (in etcd).
Kubelet is contacting apiserver every 30 seconds to delete the status of the pod for 15 Minutes 30 Seconds and apiserver responds back with error 504 (this indicates that apiserver probably wasn’t able to connect to ETCD) until it finally responds with 200 (success) at the end.
In the cluster, apiservers are connecting to ETCD via its own IP on port 2739. Headless ClusterIP Service (without any selectors) with the Endpoint pointing to the 5 ETCD Nodes facilitates connections.

Another issue was also hit, where apiservers became extremely slow to respond for about 15-16 Minutes because apiservers were NOT able to connect to the metrics-server service with errors 503 and 504. Here is what we discovered:

metrics-server pod was running on a worker node
antrea-agent on the worker node was restarted around 18:33:35
Error logs from 3 apiservers show inability to connect to the metrics-server via ClusterIP Service (172.21.90.223) for 15-16 minutes:
master001 - 18:33:35-18:50:12 - 16 min 37 sec
master002 - 18:34:54-18:50:14 - 15 min 20 sec
master003 - 18:33:45-18:50:14 - 16 min 29 sec

Note that:

etcd cluster is outside the Kubernetes cluster here.
the OS is coreos.

To Reproduce
Rolling update antrea-agent daemonset.

Expected
Rolling update antrea-agent shouldn't take 15 minutes on some nodes and shouldn't affect the connection between apiserver and metrics-server.

Actual behavior
As described above.

Versions:
Please provide the following information:

Antrea version (Docker image tag).
0.5.1
Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them.
1.15.4
Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using?
docker
Linux kernel version on the Kubernetes Nodes (uname -r).
Unknown yet
If you chose to compile the Open vSwitch kernel module manually instead of using the kernel module built into the Linux kernel, which version of the OVS kernel module are you using? Include the output of modinfo openvswitch for the Kubernetes Nodes.

Additional context

The text was updated successfully, but these errors were encountered:

tnqn · 2020-04-22T15:26:55Z

It seems the issue was because ovs conntrack state mismatched unexpectedly.

I'm able to reproduce the connection issue between metrics-server and kube-apiserver. The steps are:

Deploy metrics-server: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.6/components.yaml, better to schedule it to a Worker Node to exclude some other factors.
Check its connection list, there should be a connection with kubernetes api service 10.96.0.1:443:
metrics-server doesn't have netstat installed, need to enter its namespace via nsenter -n -t <PID>

# netstat -anp
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.30.1.84:35294        10.96.0.1:443           ESTABLISHED 29636/metrics-serve
tcp6       0      0 :::4443                 :::*                    LISTEN      29636/metrics-serve
tcp6       0      0 10.30.1.84:4443         10.30.0.1:34634         ESTABLISHED 29636/metrics-serve
tcp6       0      0 10.30.1.84:4443         10.30.0.1:34618         ESTABLISHED 29636/metrics-serve

There should be 1 conntrack record in zone 65520 (in host network):

# conntrack -L -w 65520 | grep 35294
tcp      6 86128 ESTABLISHED src=10.30.1.84 dst=10.96.0.1 sport=35294 dport=443 src=10.96.0.1 dst=10.30.1.84 sport=443 dport=35294 [ASSURED] mark=0 zone=65520 use=1

It is commited by openflow.

Delete the antrea-agent pod so that the ovs userspace daemons will restart.
Check metrics-server connection list again, its Send-Q will increase:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0    513 10.30.1.84:35294        10.96.0.1:443           ESTABLISHED 29636/metrics-serve
tcp6       0      0 :::4443                 :::*                    LISTEN      29636/metrics-serve
tcp6       0      0 10.30.1.84:4443         10.30.0.1:34634         ESTABLISHED 29636/metrics-serve
tcp6       0      0 10.30.1.84:4443         10.30.0.1:34618         ESTABLISHED 29636/metrics-serve

Check openflow stats, you should find the packets matched +inv+trk and were dropped, which is not expected:

# ovs-ofctl dump-flows br-int table=31 |grep drop
cookie=0x7f15000000000000, duration=717.871s, table=31, n_packets=27, n_bytes=19909, idle_age=18, priority=200,ct_state=+inv+trk,ip actions=drop

Check conntrack, the record is still there but not matched:

tcp      6 85593 ESTABLISHED src=10.30.1.84 dst=10.96.0.1 sport=35294 dport=443 src=10.96.0.1 dst=10.30.1.84 sport=443 dport=35294 [ASSURED] mark=0 zone=65520 use=1

The issue may not happen for the first time, try several times if not.

Steps to reproduce without antrea

After some experiments, it seems the issue was because some packets forwarded by the default flow after ovs came up and before antrea-agent installed flows. Then following packets will always be "+inv+trk" even if the conntrack record is still there. It can be reproduced by steps below:

Create a bridge, 2 namespaces, 2 veth pairs

ovs-vsctl add-br br-int

ip link add a0 type veth peer name a1
ip netns add a
ip link set a0 netns a
ip netns exec a ifconfig a0 172.16.0.2 netmask 255.255.255.0
ip link set a1 up
ovs-vsctl add-port br-int a1

ip link add b0 type veth peer name b1
ip netns add b
ip link set b0 netns b
ip netns exec b ifconfig b0 172.16.0.3 netmask 255.255.255.0
ip link set b1 up
ovs-vsctl add-port br-int b1

Install the following flows:

ovs-ofctl add-flow br-int "table=0,priority=100, ip, actions=ct(table=1,zone=65520)"
ovs-ofctl add-flow br-int "table=1,priority=100, ct_state=+inv+trk,ip, actions=drop"
ovs-ofctl add-flow br-int "table=1,priority=90, ip, actions=resubmit(,2)"
ovs-ofctl add-flow br-int "table=2,priority=100, ct_state=+new+trk,ip actions=ct(commit,table=3,zone=65520)"
ovs-ofctl add-flow br-int "table=2,priority=90, ip actions=resubmit(,3)"
ovs-ofctl add-flow br-int "table=3,priority=100, ip actions=normal"

We expect connections are committed to zone 65520 and not dropped.
3. Start a server in netns a, connect the server in netns b

# Execute it in a terminal, we will input something later
ip netns exec a nc -l 80
# Execute it in another terminal, we will input something later
ip netns exec b nc 172.16.0.2 80

Restart ovs

/etc/init.d/openvswitch-switch stop
/etc/init.d/openvswitch-switch start

Input something in both client and server terminal and send to the opposite. At this moment, the packets will be forwarded by the default flow.
Reinstall the flows

ovs-ofctl add-flow br-int "table=0,priority=100, ip, actions=ct(table=1,zone=65520)"
ovs-ofctl add-flow br-int "table=1,priority=100, ct_state=+inv+trk,ip, actions=drop"
ovs-ofctl add-flow br-int "table=1,priority=90, ip, actions=resubmit(,2)"
ovs-ofctl add-flow br-int "table=2,priority=100, ct_state=+new+trk,ip actions=ct(commit,table=3,zone=65520)"
ovs-ofctl add-flow br-int "table=2,priority=90, ip actions=resubmit(,3)"
ovs-ofctl add-flow br-int "table=3,priority=100, ip actions=normal"

Now client and server cannot reach each other, the packets were dropped because they matched "+inv+trk", but the conntrack record was still in zone 65520.
If we don't do step 5, the issue will not happen.
If we only send message from client to server or the opposite direction, the issue will not happen.

How to fix

While the issue could be because of some unexpected behavior in ovs which we should ask for OVS community for help, we could improve antrea to avoid it:
Instead of having a time slot that packets can be forwarded by default flow, which is not safe and causes the conntrack issue we have observed, we could start ovs with flow-restore-wait set to true, so that no packets will be forwarded until antrea-agent installs required flows and clean the flag. This is similar to how openvswitch scripts implement restart.
@jianjuns @antoninbas @salv-orlando what do you think?

BTW, I can reproduce it with OVS userspace 2.13.0 and 2.12.0.

antoninbas · 2020-04-22T17:00:58Z

Great finding @tnqn

Then following packets will always be "+inv+trk" even if the conntrack record is still there.

Do you known why? Is this an issue with OVS? Why does it depend on whether some packets are forwarded using the default Normal flow after OVS comes back up?

tnqn · 2020-04-23T02:44:47Z

@antoninbas I don't know why yet, I'm asking OVS experts whether this is an issue or misconfiguration. Do you think we could use flow-restore-wait to block the forwarding until antrea-agent installs necessary flows? It may cause packets dropped for a few seconds even for established connections.

antoninbas · 2020-04-23T02:46:47Z

@tnqn Do you think we can clear the flag after adding the "connectivity" flows (that should be very fast) and without waiting for NP flows? If yes, then let's make the change.

tnqn · 2020-04-23T02:57:07Z

@antoninbas Yes, I think so, currently we don't wait for NP flows to be installed before enabling forwarding anyway.

jianjuns · 2020-04-24T18:41:55Z

But what flows are needed to avoid the conntrack state mismatch? Any flow, except the default flow?

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like antrea-io#625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

tnqn · 2020-04-27T13:59:02Z

Thanks Yi-Hung Wei from OVS community for finding the root cause, quoting his explanation:

I think the root cause of this issue is because of the TCP window checking in nf_conntrack. Basically, if the TCP window checking is enabled, nf_conntrack will check if the ACK and SEQ number of a packet is within a valid range. If the ACK or SEQ number is not complied with TCP protocol, nf_conntrack will mark the packet as invalid.

This is issue happened in your proposed scenario because
In step 3, ovs commits the TCP connection to nf_conntrack.
In step 5, we inject some traffic between the nc server and client that change the TCP window. However, with the normal rule, OVS does not send the packets to nf_conntrack to update the corresponding TCP state.
In step 6, we re-insert the flows, and send the TCP traffic to nf_conntrack. At this time point, nf_conntrack detects inconsistent SEQ/ACK number of the TCP stream, and marks the packet as invalid.

@jianjuns according to the explanation above, it's not only the default flow, we must send all packets to conntrack to keep the connection's state tracked correctly.

I think using "flow-restore-wait" can fix it, and we should start forwarding at least after restoring the pipeline, but it's also good to wait until pod's flow are restored since it's very fast.

Besides, I found another good reason of using "flow-restore-wait", previously datapath flows would be cleaned once ovs-vswitchd was started, so existing connections especially cross-node ones could still have some downtime before antrea-agent restores the flows.

With "flow-restore-wait", ovs-vswitchd won't flush or expire previously set datapath flows, so existing connections can achieve real 0 downtime in theory.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like antrea-io#625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

jianjuns · 2020-04-27T17:18:15Z

Sounds a good approach (though ideally we should apply all flows in a single bundle after restart).

antoninbas · 2020-04-27T21:39:22Z

Actually I was thinking about this more over the weekend. Should it be considered a security issue that we restore connectivity before we re-install NP flows? Maybe we should switch to a single bundle ASAP as Jianjun suggested.

tnqn · 2020-04-28T03:10:17Z

@jianjuns @antoninbas I agree that ideally we should enable ovs-vswitchd forwarding after installing all flows including Pod, Route, NP flows, especially now established connections can continue work and new connections won't be mishandled in default fashion with flow-restore-wait set.
However, in current code we don't have a clear boundary that we have installed Route and NP flows. We need a way for NodeRouteController and NetworkPolicyController to notify that they have installed flows for initial routes and policies. I think we can do it in another PR, as the issue is there with or without this PR.
For a single bundle to apply all flows, I'm not sure if it's easy to achieve as we have several modules installing their own flows, but is there a difference that the initial flows are installed in a single bundle or by several modules separately if flow-restore-wait is set? An usage of this flag is to avoid intermediate state too:

This prevents controllers from making changes to the flow table in the middle of flow restoration, which could result in undesirable intermediate states.

tnqn · 2020-04-28T03:15:39Z

BTW, now I am dealing an issue in #658 that kind cluster doesn't work with flow-restore-wait, I think it might be because the node's physical network requires OVS forwarding but antrea-agent requires physical network to fetch node and pod information before enabling forwarding, do you have any idea how I could make Kind work? @antoninbas
If no solution, I might have to set up flows that has no network dependency first, but it would be more far away from the ideal solution we talked above.

antoninbas · 2020-04-28T03:48:54Z

@tnqn the easiest thing to do for now may be to avoid using flow-restore-wait altogether for Kind clusters, i.e. if the datapath type is netdev

jianjuns · 2020-04-28T04:54:53Z

Agreed we should fix the traffic drop issue first.

However, in current code we don't have a clear boundary that we have installed Route and NP flows. We need a way for NodeRouteController and NetworkPolicyController to notify that they have installed flows for initial routes and policies. I think we can do it in another PR, as the issue is there with or without this PR.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like antrea-io#625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like #625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like antrea-io#625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like #625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

This patch starts ovs-vswitchd with flow-restore-wait set to true and removes the config after restoring necessary flows for the following reasons: 1. It prevents packets from being mishandled by ovs-vswitchd in its default fashion, which could affect existing connections' conntrack state and cause issues like antrea-io#625. 2. It prevents ovs-vswitchd from flushing or expiring previously set datapath flows, so existing connections can achieve 0 downtime during OVS restart. As a result, we remove the config here after restoring necessary flows.

tnqn added the bug label Apr 17, 2020

tnqn mentioned this issue Apr 27, 2020

Start ovs-vswitchd with flow-restore-wait #658

Merged

tnqn self-assigned this Apr 27, 2020

antoninbas added this to the Antrea v0.6.0 release milestone Apr 29, 2020

antoninbas closed this as completed in #658 Apr 29, 2020

antoninbas mentioned this issue May 16, 2024

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some antrea-agent pods stuck in termination state for around 15 minutes #625

Some antrea-agent pods stuck in termination state for around 15 minutes #625

tnqn commented Apr 17, 2020

tnqn commented Apr 22, 2020 •

edited

Loading

antoninbas commented Apr 22, 2020

tnqn commented Apr 23, 2020

antoninbas commented Apr 23, 2020

tnqn commented Apr 23, 2020

jianjuns commented Apr 24, 2020

tnqn commented Apr 27, 2020

jianjuns commented Apr 27, 2020

antoninbas commented Apr 27, 2020

tnqn commented Apr 28, 2020

tnqn commented Apr 28, 2020

antoninbas commented Apr 28, 2020

jianjuns commented Apr 28, 2020

Some antrea-agent pods stuck in termination state for around 15 minutes #625

Some antrea-agent pods stuck in termination state for around 15 minutes #625

Comments

tnqn commented Apr 17, 2020

tnqn commented Apr 22, 2020 • edited Loading

Steps to reproduce without antrea

How to fix

antoninbas commented Apr 22, 2020

tnqn commented Apr 23, 2020

antoninbas commented Apr 23, 2020

tnqn commented Apr 23, 2020

jianjuns commented Apr 24, 2020

tnqn commented Apr 27, 2020

jianjuns commented Apr 27, 2020

antoninbas commented Apr 27, 2020

tnqn commented Apr 28, 2020

tnqn commented Apr 28, 2020

antoninbas commented Apr 28, 2020

jianjuns commented Apr 28, 2020

tnqn commented Apr 22, 2020 •

edited

Loading