Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

wenyingd
Copy link
Contributor

The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours.

After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that,

  1. funciton "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked by requiring lock "Memberlist.nodeLock".
  2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds the lock "Memberlist.nodeLock". It is blocking at sending message to "Cluster.nodeEventsCh", while the consumer is also blocking.

The issue may happen in the large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes.

To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.

Signed-off-by: wenyingd wenyingd@vmware.com

The issue is several Antrea Agent are out of memory in a large scale cluster, and
we observe that the memory of the failed Antrea Agent is continuously increasing,
from 400MB to 1.8G in less than 24 hours.

After profiling Agent memory and call stack, we find that most memory is taken by
Node resources received by Node informer watch function. From the goroutines, we
find a dead lock that,
1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked
   by requiring "Memberlist.nodeLock".
2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds
   the lock "Memberlist.nodeLock". It is blocking at sending message to
   "Cluster.nodeEventsCh", while the consumer is also blocking.

The issue may happen in a large scale setup. Although Antrea has 1024 messages
buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the
channel is full before Agent completes sending out the Member join message on the
existing Nodes.

To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join()
in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.

Signed-off-by: wenyingd <wenyingd@vmware.com>
@wenyingd wenyingd requested a review from tnqn December 13, 2022 06:48
@wenyingd wenyingd added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2022
@wenyingd wenyingd added this to the Antrea v1.10 release milestone Dec 13, 2022
@codecov
Copy link

codecov bot commented Dec 13, 2022

Codecov Report

Merging #4469 (a1690de) into main (b503a46) will decrease coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4469      +/-   ##
==========================================
- Coverage   65.89%   65.79%   -0.11%     
==========================================
  Files         402      402              
  Lines       57247    57227      -20     
==========================================
- Hits        37723    37650      -73     
- Misses      16822    16881      +59     
+ Partials     2702     2696       -6     
Flag Coverage Δ
e2e-tests 38.32% <ø> (?)
integration-tests 34.59% <ø> (ø)
kind-e2e-tests 47.11% <ø> (-0.74%) ⬇️
unit-tests 51.03% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pkg/agent/memberlist/cluster.go 73.46% <ø> (-1.69%) ⬇️
pkg/agent/flowexporter/exporter/certificate.go 27.77% <0.00%> (-22.23%) ⬇️
...nt/apiserver/handlers/serviceexternalip/handler.go 29.62% <0.00%> (-22.23%) ⬇️
pkg/ipfix/ipfix_process.go 81.25% <0.00%> (-18.75%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 61.48% <0.00%> (-15.55%) ⬇️
...g/agent/controller/serviceexternalip/controller.go 68.60% <0.00%> (-12.97%) ⬇️
pkg/agent/controller/networkpolicy/reject.go 66.00% <0.00%> (-9.86%) ⬇️
pkg/controller/networkpolicy/tier.go 53.84% <0.00%> (-4.62%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 75.75% <0.00%> (-4.33%) ⬇️
pkg/agent/cniserver/server.go 74.94% <0.00%> (-3.44%) ⬇️
... and 26 more

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 13, 2022
@tnqn
Copy link
Member

tnqn commented Dec 13, 2022

/test-all

@tnqn tnqn requested review from wenqiq and xliuxu December 13, 2022 07:31
Copy link
Contributor

@xliuxu xliuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

@wenqiq
Copy link
Contributor

wenqiq commented Dec 13, 2022

Good job and excellent debugging. The "Memberlist.Join()" in "Cluster.Run()" ensures that all Nodes join the memberlist. I have a question here, when you deploy antrea in an existing Cluster, will the handleCreateNode still be executed?

@wenyingd
Copy link
Contributor Author

wenyingd commented Dec 13, 2022

I have a question here, when you deploy antrea in an existing Cluster, will the handleCreateNode still be executed?

Yes, it is still called by NodeInformer when a "NodeAdd" event is watched.

Copy link
Contributor

@wenqiq wenqiq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenyingd
Copy link
Contributor Author

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

Although starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) could avoid the event channel blocking, it leads to that the call of "c.mList.Join" on the existing Node twice: one is in the Run, and the other is called by NodeAdd event. It is duplicated on the cluster network, and other Agent may also receive and process the message twice.

@xliuxu
Copy link
Contributor

xliuxu commented Dec 13, 2022

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

Although starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) could avoid the event channel blocking, it leads to that the call of "c.mList.Join" on the existing Node twice: one is in the Run, and the other is called by NodeAdd event. It is duplicated on the cluster network, and other Agent may also receive and process the message twice.

Agreed. Removing unnecessary calls should be a better approach.

@tnqn tnqn merged commit eb47df9 into antrea-io:main Dec 13, 2022
@tnqn
Copy link
Member

tnqn commented Dec 13, 2022

@wenyingd please backport the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants