Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nil pointer issue when clusterset is deleted in leader #3915

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

luolanzone
Copy link
Contributor

@luolanzone luolanzone commented Jun 20, 2022

When the ClusterSet is deleted in leader Namespace, there will be
a nil pointer issue due to data inconsistency. This issue will cause
leader controller crash and restart. Refined codes to make data
consistent.
```
leader_clusterset_controller.go:72] "Received ClusterSet delete" config=""
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x188b9b1]

goroutine 274 [running]:
antrea.io/antrea/multicluster/controllers/multicluster.(*MemberClusterAnnounceReconciler).processMCSStatus(0xc00052ff20)
        /antrea/multicluster/controllers/multicluster/memberclusterannounce_controller.go:181 +0x331
...
```

Fixes #3918

Signed-off-by: Lan Luo luola@vmware.com

@luolanzone
Copy link
Contributor Author

/test-multicluster-dataplane-e2e

@luolanzone luolanzone requested review from jianjuns and tnqn June 20, 2022 10:01
@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2022

Codecov Report

Merging #3915 (c7577f5) into main (52bb02e) will decrease coverage by 9.55%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3915      +/-   ##
==========================================
- Coverage   63.91%   54.35%   -9.56%     
==========================================
  Files         293      290       -3     
  Lines       43227    43166      -61     
==========================================
- Hits        27628    23463    -4165     
- Misses      13363    17780    +4417     
+ Partials     2236     1923     -313     
Flag Coverage Δ
kind-e2e-tests 31.92% <ø> (-18.14%) ⬇️
unit-tests 44.57% <85.71%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...llers/multicluster/member_clusterset_controller.go 9.73% <0.00%> (ø)
...s/multicluster/memberclusterannounce_controller.go 71.79% <93.75%> (+6.79%) ⬆️
...llers/multicluster/leader_clusterset_controller.go 63.76% <100.00%> (ø)
pkg/agent/nodeportlocal/k8s/annotations.go 0.00% <0.00%> (-100.00%) ⬇️
...g/agent/apiserver/handlers/featuregates/handler.go 4.54% <0.00%> (-77.28%) ⬇️
pkg/agent/client.go 0.00% <0.00%> (-76.75%) ⬇️
pkg/agent/nodeportlocal/k8s/npl_controller.go 0.00% <0.00%> (-61.13%) ⬇️
pkg/apis/controlplane/v1beta2/helper.go 40.00% <0.00%> (-60.00%) ⬇️
...agent/controller/traceflow/traceflow_controller.go 14.03% <0.00%> (-58.93%) ⬇️
pkg/agent/nodeportlocal/portcache/port_table.go 8.75% <0.00%> (-58.75%) ⬇️
... and 112 more

@luolanzone luolanzone added the area/multi-cluster Issues or PRs related to multi cluster. label Jun 20, 2022
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit message, change "," to "." or ", and" between two sentences.

Please pay attention to such basic grammar mistakes (I often saw it in your PR and reminded you a few times for the same one).

@@ -232,10 +233,10 @@ func (r *MemberClusterAnnounceReconciler) processMCSStatus() {

/******************************* MemberClusterStatusManager methods *******************************/

func (r *MemberClusterAnnounceReconciler) AddMember(MemberId common.ClusterID) {
defer r.mapLock.Unlock()
func (r *MemberClusterAnnounceReconciler) AddMember(memberId common.ClusterID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change memberId to memberID

@jianjuns
Copy link
Contributor

@luolanzone what will happen when the issue occurs? MC Controller will crash?

Please still create an issue to track it. We should do more tests to detect such basic issues.

@luolanzone
Copy link
Contributor Author

I will pay attention on it.

In commit message, change "," to "." or ", and" between two sentences.

Please pay attention to such basic grammar mistakes (I often saw it in your PR and reminded you a few times for the same one).

Sure, thanks for reminding. I missed this kind of issue, I will double check it next time.

@luolanzone
Copy link
Contributor Author

@jianjuns I have created a new issue #3918 for this PR, and I will add e2e test case for this issue later. Could you help to review again? thanks.

Conditions: conditions}

r.timerData[MemberId] = &timerData{connected: false, lastUpdateTime: time.Time{}}
r.timerData[memberID] = &timerData{connected: false, lastUpdateTime: time.Time{}}
klog.InfoS("new member is added", "member", memberID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please capitalize logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 276 to 278
if _, ok := r.timerData[memberID]; ok {
klog.InfoS("remove members", "member", memberID)
delete(r.timerData, memberID)
Copy link
Member

@tnqn tnqn Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous code doesn't have data race. The issue was caused by data inconsistency and it's still there.
RemoveMember removed the item from r.memberStatus but not r.timerData, but processMCSStatus expects items existing in r.timerData also exist in r.memberStatus, so accessing r.memberStatus[member].Condition panic.

I think the right fix is removing the two return in this function. We should add unit test in this PR, not a separate PR. One of the points of tests is to test the situation that has the issue, and to prove this patch can fix it. I don't think an e2e test can be very helpful here, it's likely unable to reproduce the scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

if _, ok := r.timerData[MemberId]; ok {
delete(r.timerData, MemberId)
if _, ok := r.timerData[memberID]; ok {
klog.InfoS("remove members", "member", memberID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"remove members" is not clear whether it's going to remove it or it has done it. Please move it to the bottom of the method and change the log to "Removed member" to indicate it has been done.
Perhaps change the log in AddMember to "Added member" to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -60,5 +59,8 @@ $GOPATH/bin/informer-gen \
--output-package "${ANTREA_PKG}/multicluster/pkg/client/informers" \
--go-header-file hack/boilerplate.go.txt

go get sigs.k8s.io/controller-tools/cmd/controller-gen@v0.9.0
Copy link
Member

@tnqn tnqn Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to include unrelated changes in bug fix PRs, one PR should focus on one purpose, especially when it's a bug fix. Imageine how we handle it if this needs to be backported. The side effect of bumping libraries would be backported or more efforts would be spent on dividing the patches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other changes have been moved to this PR #3919

@luolanzone luolanzone force-pushed the fix-nil-pointer branch 2 times, most recently from 3327c3a to 7bf0f27 Compare June 21, 2022 06:51

GetMemberClusterStatuses() []multiclusterv1alpha1.ClusterStatus
// This is for unit test only
GetTimerData() map[common.ClusterID]*timerData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had similar comments before, does test code really need this interface to access the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, it's unnecessary, removed it.

@luolanzone
Copy link
Contributor Author

/test-multicluster-dataplane-e2e

@tnqn
Copy link
Member

tnqn commented Jun 21, 2022

s/due to data inconsistent/due to data inconsistency/

and perhaps remove the code line and stack trace from commit message which doesn't look neat in git log history, and is more suitable to be put in issue.

@luolanzone
Copy link
Contributor Author

s/due to data inconsistent/due to data inconsistency/

and perhaps remove the code line and stack trace from commit message which doesn't look neat in git log history, and is more suitable to be put in issue.

Done

@luolanzone
Copy link
Contributor Author

/test-multicluster-dataplane-e2e

1 similar comment
@luolanzone
Copy link
Contributor Author

/test-multicluster-dataplane-e2e

status := memberClusterAnnounceReconcilerUnderTest.GetMemberClusterStatuses()
expectedTimeData := memberClusterAnnounceReconcilerUnderTest.timerData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not expected data but actual data.
and it should be timerData not timeData to be consistent.
either actualTimerData or just timerData, but better to add "actual" to both of "status" and "timerData" or none of them.

Copy link
Contributor Author

@luolanzone luolanzone Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed to actualTimerData

When the ClusterSet is deleted in leader Namespace, there will be
a nil pointer issue due to data inconsistency. This issue will cause
leader controller crash and restart. Refined codes to make data
consistent.

Signed-off-by: Lan Luo <luola@vmware.com>
@luolanzone
Copy link
Contributor Author

/test-multicluster-dataplane-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jianjuns
Copy link
Contributor

/skip-all
The change impacts only Multi-cluster Controller.

@jianjuns jianjuns merged commit b2c245c into antrea-io:main Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multi-cluster Issues or PRs related to multi cluster.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MC leader controller will crash when ClusterSet is deleted
4 participants