Skip to content

Commit

Permalink
Bugfix: Resolve a deadlock in cluster memberlist maintanance
Browse files Browse the repository at this point in the history
The issue is several Antrea Agent are out of memory in a large scale cluster, and
we observe that the memory of the failed Antrea Agent is continuously increasing,
from 400MB to 1.8G in less than 24 hours.

After profiling Agent memory and call stack, we find that most memory is taken by
Node resources received by Node informer watch function. From the goroutines, we
find a dead lock that,
1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked
   by requiring "Memberlist.nodeLock".
2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds
   the lock "Memberlist.nodeLock". It is blocking at sending message to
   "Cluster.nodeEventsCh", while the consumer is also blocking.

The issue may happen in a large scale setup. Although Antrea has 1024 messages
buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the
channel is full before Agent completes sending out the Member join message on the
existing Nodes.

To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join()
in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.

Signed-off-by: wenyingd <wenyingd@vmware.com>
  • Loading branch information
wenyingd committed Dec 13, 2022
1 parent c891945 commit 93645a1
Show file tree
Hide file tree
Showing 2 changed files with 0 additions and 30 deletions.
27 changes: 0 additions & 27 deletions pkg/agent/memberlist/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -275,23 +275,6 @@ func (c *Cluster) newClusterMember(node *corev1.Node) (string, error) {
return nodeAddr.String(), nil
}

func (c *Cluster) allClusterMembers() (clusterNodes []string, err error) {
nodes, err := c.nodeLister.List(labels.Everything())
if err != nil {
return nil, fmt.Errorf("listing Nodes error: %v", err)
}

for _, node := range nodes {
member, err := c.newClusterMember(node)
if err != nil {
klog.ErrorS(err, "Get Node failed")
continue
}
clusterNodes = append(clusterNodes, member)
}
return
}

func (c *Cluster) filterEIPsFromNodeLabels(node *corev1.Node) sets.String {
pools := sets.NewString()
eips, err := c.externalIPPoolLister.List(labels.Everything())
Expand Down Expand Up @@ -325,16 +308,6 @@ func (c *Cluster) Run(stopCh <-chan struct{}) {
return
}

members, err := c.allClusterMembers()
if err != nil {
klog.ErrorS(err, "List cluster members failed")
} else if members != nil {
_, err := c.mList.Join(members)
if err != nil {
klog.ErrorS(err, "Join cluster failed")
}
}

for i := 0; i < defaultWorkers; i++ {
go wait.Until(c.worker, time.Second, stopCh)
}
Expand Down
3 changes: 0 additions & 3 deletions pkg/agent/memberlist/cluster_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -156,9 +156,6 @@ func TestCluster_Run(t *testing.T) {
res, err := fakeCluster.cluster.ShouldSelectIP(tCase.egress.Spec.EgressIP, eip.Name)
return err == nil && res == tCase.expectEgressSelectResult, nil
}), "select Node result for Egress does not match")
allMembers, err := fakeCluster.cluster.allClusterMembers()
assert.NoError(t, err)
assert.Len(t, allMembers, 1, "expected Node member num is 1")
assert.Equal(t, 1, fakeCluster.cluster.mList.NumMembers(), "expected alive Node num is 1")
})
}
Expand Down

0 comments on commit 93645a1

Please sign in to comment.