Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #4469: Bugfix: Resolve a deadlock in cluster memberlist maintanance #4472

Conversation

wenyingd
Copy link
Contributor

Cherry pick of #4469 on release-1.8.

#4469: Bugfix: Resolve a deadlock in cluster memberlist maintanance

For details on the cherry pick process, see the cherry pick requests page.

The issue is several Antrea Agent are out of memory in a large scale cluster, and
we observe that the memory of the failed Antrea Agent is continuously increasing,
from 400MB to 1.8G in less than 24 hours.

After profiling Agent memory and call stack, we find that most memory is taken by
Node resources received by Node informer watch function. From the goroutines, we
find a dead lock that,
1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked
   by requiring "Memberlist.nodeLock".
2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds
   the lock "Memberlist.nodeLock". It is blocking at sending message to
   "Cluster.nodeEventsCh", while the consumer is also blocking.

The issue may happen in a large scale setup. Although Antrea has 1024 messages
buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the
channel is full before Agent completes sending out the Member join message on the
existing Nodes.

To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join()
in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.

Signed-off-by: wenyingd <wenyingd@vmware.com>
@wenyingd wenyingd added the kind/cherry-pick Categorizes issue or PR as related to the cherry-pick of a bug fix from the main branch to a release label Dec 13, 2022
@wenyingd
Copy link
Contributor Author

/test-all

@codecov
Copy link

codecov bot commented Dec 13, 2022

Codecov Report

Merging #4472 (93645a1) into release-1.8 (c891945) will increase coverage by 3.41%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           release-1.8    #4472      +/-   ##
===============================================
+ Coverage        62.57%   65.98%   +3.41%     
===============================================
  Files              304      304              
  Lines            46634    46614      -20     
===============================================
+ Hits             29180    30758    +1578     
+ Misses           15146    13460    -1686     
- Partials          2308     2396      +88     
Flag Coverage Δ
integration-tests 35.10% <ø> (-0.13%) ⬇️
kind-e2e-tests 48.44% <ø> (+6.78%) ⬆️
unit-tests 45.44% <ø> (+0.46%) ⬆️
Impacted Files Coverage Δ
pkg/agent/memberlist/cluster.go 76.45% <ø> (+5.20%) ⬆️
pkg/controller/networkpolicy/tier.go 50.00% <0.00%> (-5.00%) ⬇️
...lticluster/commonarea/resourceimport_controller.go 74.45% <0.00%> (-3.29%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 51.08% <0.00%> (-1.74%) ⬇️
...trollers/multicluster/resourceexport_controller.go 78.68% <0.00%> (-1.64%) ⬇️
...r/ipseccertificate/ipsec_certificate_controller.go 61.56% <0.00%> (-0.98%) ⬇️
...g/agent/cniserver/interface_configuration_linux.go 26.81% <0.00%> (-0.97%) ⬇️
pkg/ovs/ovsconfig/ovs_client.go 65.66% <0.00%> (+0.12%) ⬆️
pkg/ipam/ipallocator/allocator.go 88.14% <0.00%> (+0.51%) ⬆️
pkg/controller/ipam/validate.go 80.32% <0.00%> (+0.54%) ⬆️
... and 61 more

@tnqn tnqn merged commit 45b2090 into antrea-io:release-1.8 Dec 13, 2022
@wenyingd wenyingd deleted the automated-cherry-pick-of-#4469-upstream-release-1.8 branch May 30, 2023 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cherry-pick Categorizes issue or PR as related to the cherry-pick of a bug fix from the main branch to a release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants