Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Service update processing #4845

Merged
merged 1 commit into from
Apr 19, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Apr 11, 2023

Currently the installServices method is somewhat long and redundant, making it hard to maintain and error-prone. In fact, a few bugs were found in the recent releases which are related to it more or less. While sorting out the code, I found there are actually more bugs in it:

  1. After updating stickyMaxAgeSeconds, the flow for ClusterIP isn't updated because the installServiceFlows interface skip updating flows whose cache keys already exist.
  2. After updating stickyMaxAgeSeconds, the flows for NodePort and LoadBalancerIPs aren't updated because the installServiceFlows interface is not even called.
  3. After updating InternalTrafficPolicy, the flows for ClusterIP isn't updated.
  4. After updating InternalTrafficPolicy for a ClusterIP Service, the stale group is not removed.
  5. Endpoints are installed repeatedly even though there are already reference counters for them.

This patch tries to refactor the method to make it eaiser to understand and maintain, and fixes all the above bugs. It makes the following changes:

  1. Code redundancy is reduced with some shareable sub-procedures being extracted to sub-functions.
  2. Calculation of Variables that are required by a sub-procedure only are moved to the corresponding sub-function.
  3. Repeated code that retrieves the group IDs are removed.
  4. The ways of processing ClusterIP, NodePort, and LoadBalancerIPs are unified.
  5. A method for installing Endpoints in the same way as uninstalling Endpoints is added.
  6. Use needUpdateService to represent all the flows of the Service need update, and use needUpdateServiceExternalAddresses to represent only the flows related to ExternalAddresses need update.

@tnqn tnqn added area/proxy Issues or PRs related to proxy functions in Antrea action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Apr 11, 2023
@tnqn tnqn added this to the Antrea v1.12 release milestone Apr 11, 2023
@tnqn
Copy link
Member Author

tnqn commented Apr 11, 2023

/test-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description, you list a few scenarios that were not supported correctly. For example: "After updating InternalTrafficPolicy, the flows for ClusterIP isn't updated."
Should we add unit test coverage for these scenarios?

if exists && !needUpdate {
return groupID, true
}
succeed := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/succeed/success

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}

func serviceExternalAddressesChanged(svcInfo, pSvcInfo *types.ServiceInfo) bool {
return svcInfo.NodePort() != pSvcInfo.NodePort() || !reflect.DeepEqual(svcInfo.LoadBalancerIPStrings(), pSvcInfo.LoadBalancerIPStrings())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use https://pkg.go.dev/golang.org/x/exp/slices#Equal now instead of DeepEqual

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I saw we don't have other dependency on this package. Given it's declared experimental and unreliable, I used "k8s.io/utils/strings/slices" which has almost exactly same code.

@tnqn tnqn force-pushed the reduce-antrea-proxy-redundance branch from 8f10002 to 1392758 Compare April 12, 2023 03:23
Copy link
Member Author

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description, you list a few scenarios that were not supported correctly. For example: "After updating InternalTrafficPolicy, the flows for ClusterIP isn't updated."
Should we add unit test coverage for these scenarios?

@antoninbas all these scenarios were covered by unit test, but the expectations were also wrong. I have added comments how the changes of test code validate the fixes.

Comment on lines -1966 to -1967
mockOFClient.EXPECT().UninstallServiceFlows(svcIP, uint16(svcPort), bindingProtocol).Times(1)
mockOFClient.EXPECT().InstallServiceFlows(groupID, svcIP, uint16(svcPort), bindingProtocol, uint16(0), false, corev1.ServiceTypeClusterIP, false).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of these lines validates that there is no need to reinstall ClusterIP flows when only NodePort changes.

Comment on lines -1982 to -1986

mockOFClient.EXPECT().UninstallServiceFlows(loadBalancerIP, uint16(svcPort), bindingProtocol)
mockOFClient.EXPECT().InstallServiceFlows(groupID, loadBalancerIP, uint16(svcPort), bindingProtocol, uint16(0), false, corev1.ServiceTypeLoadBalancer, false).Times(1)
mockRouteClient.EXPECT().DeleteLoadBalancer(loadBalancerIP).Times(1)
mockRouteClient.EXPECT().AddLoadBalancer(loadBalancerIP).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of these lines validates that there is no need to reinstall LoadBalancerIP flows when only NodePort changes.

@@ -2082,7 +2072,6 @@ func testServiceExternalTrafficPolicyUpdate(t *testing.T,
mockOFClient.EXPECT().InstallServiceFlows(groupID, svcIP, uint16(svcPort), bindingProtocol, uint16(0), true, corev1.ServiceTypeClusterIP, false).Times(1)

if svcType == corev1.ServiceTypeNodePort || svcType == corev1.ServiceTypeLoadBalancer {
mockOFClient.EXPECT().InstallEndpointFlows(bindingProtocol, gomock.InAnyOrder(expectedAllEps)).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this line validates the 5th bug "Endpoints are installed repeatedly" is fixed.

mockOFClient.EXPECT().UninstallEndpointFlows(bindingProtocol, expectedRemoteEps).Times(1)
mockOFClient.EXPECT().UninstallServiceGroup(groupID).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line validates the 4th bug "After updating InternalTrafficPolicy for a ClusterIP Service, the stale group is not removed" is fixed.

mockOFClient.EXPECT().InstallServiceGroup(groupIDLocal, false, expectedLocalEps).Times(1)
mockOFClient.EXPECT().InstallServiceFlows(groupIDLocal, svcIP, uint16(svcPort), bindingProtocol, uint16(0), false, corev1.ServiceTypeClusterIP, false).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uninstall and install validate the 3rd bug "After updating InternalTrafficPolicy, the flows for ClusterIP isn't updated" is fixed.

@@ -2344,16 +2335,24 @@ func testServiceStickyMaxAgeSecondsUpdate(t *testing.T,
mockOFClient.EXPECT().InstallEndpointFlows(bindingProtocol, expectedEps).Times(1)
mockOFClient.EXPECT().InstallServiceGroup(groupID, true, expectedEps).Times(1)
mockOFClient.EXPECT().InstallServiceFlows(groupID, svcIP, uint16(svcPort), bindingProtocol, uint16(affinitySeconds), false, corev1.ServiceTypeClusterIP, false).Times(1)

mockOFClient.EXPECT().UninstallServiceFlows(svcIP, uint16(svcPort), bindingProtocol).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uninstall and install validate the 1st bug "After updating stickyMaxAgeSeconds, the flow for ClusterIP isn't updated" is fixed.

Comment on lines +2344 to +2347
mockOFClient.EXPECT().UninstallServiceFlows(vIP, uint16(svcNodePort), bindingProtocol).Times(1)
mockRouteClient.EXPECT().DeleteNodePort(nodePortAddresses, uint16(svcNodePort), bindingProtocol).Times(1)
mockOFClient.EXPECT().InstallServiceFlows(groupID, vIP, uint16(svcNodePort), bindingProtocol, uint16(updatedAffinitySeconds), false, corev1.ServiceTypeNodePort, false).Times(1)
mockRouteClient.EXPECT().AddNodePort(nodePortAddresses, uint16(svcNodePort), bindingProtocol).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uninstall and install validate the 2st bug "After updating stickyMaxAgeSeconds, the flows for NodePort aren't updated" is fixed.

Comment on lines +2352 to +2355
mockOFClient.EXPECT().UninstallServiceFlows(loadBalancerIP, uint16(svcPort), bindingProtocol).Times(1)
mockRouteClient.EXPECT().DeleteLoadBalancer(loadBalancerIP).Times(1)
mockOFClient.EXPECT().InstallServiceFlows(groupID, loadBalancerIP, uint16(svcPort), bindingProtocol, uint16(updatedAffinitySeconds), false, corev1.ServiceTypeLoadBalancer, false).Times(1)
mockRouteClient.EXPECT().AddLoadBalancer(loadBalancerIP).Times(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uninstall and install validate the 2st bug "After updating stickyMaxAgeSeconds, the flows for LoadBalancerIPs aren't updated" is fixed.

pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
@@ -410,11 +464,18 @@ func (p *proxier) installServices() {

installedSvcPort, ok := p.serviceInstalledMap[svcPortName]
var pSvcInfo *types.ServiceInfo
var needRemoval, needUpdateService, needUpdateEndpoints bool
var needUpdateServiceExternalAddresses, needUpdateService, needUpdateEndpoints bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more flags like these since updating some attributes of a Service will not affect all flows and configurations of a Service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible to be more fine-grained, however, I think it's not very worth to save a few calls but introduce many special processing for some infrequent operations. Unless it can be done without adding many complexities, I feel it's unnecessary.

pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
pkg/agent/proxy/proxier.go Show resolved Hide resolved
pkg/agent/proxy/proxier.go Outdated Show resolved Hide resolved
@tnqn tnqn force-pushed the reduce-antrea-proxy-redundance branch from 1392758 to 8a0fcfb Compare April 12, 2023 06:57
hongliangl
hongliangl previously approved these changes Apr 13, 2023
Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Currently the installServices method is somewhat long and redundant,
making it hard to maintain and error-prone. In fact, a few bugs were
found in the recent releases which are related to it more or less. While
sorting out the code, I found there are actually more bugs in it:

1. After updating stickyMaxAgeSeconds, the flow for ClusterIP isn't
   updated because the installServiceFlows interface skip updating flows
   whose cache keys already exist.
2. After updating stickyMaxAgeSeconds, the flows for NodePort and
   LoadBalancerIPs aren't updated because the installServiceFlows
   interface is not even called.
3. After updating InternalTrafficPolicy, the flows for ClusterIP isn't
   updated.
4. After updating InternalTrafficPolicy for a ClusterIP Service, the
   stale group is not removed.
5. Endpoints are installed repeatedly even though there are already
   reference counters for them.

This patch tries to refactor the method to make it eaiser to understand
and maintain, and fixes all the above bugs. It makes the following
changes:

1. Code redundancy is reduced with some shareable sub-procedures being
   extracted to sub-functions.
2. Calculation of Variables that are required by a sub-procedure only
   are moved to the corresponding sub-function.
2. Repeated code that retrieves the group IDs are removed.
3. The ways of processing ClusterIP, NodePort, and LoadBalancerIPs are
   unified.
4. A method for installing Endpoints in the same way as uninstalling
   Endpoints is added.
5. Use needUpdateService to represent all the flows of the Service need
   update, and use needUpdateServiceExternalAddresses to represent only
   the flows related to ExternalAddresses need update.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Apr 17, 2023

/test-all

@tnqn
Copy link
Member Author

tnqn commented Apr 17, 2023

@antoninbas could you take another look at this one?

@tnqn
Copy link
Member Author

tnqn commented Apr 18, 2023

/test-ipv6-e2e
/test-ipv6-conformance
/test-ipv6-only-e2e
/test-ipv6-only-conformance
/test-windows-all

@tnqn
Copy link
Member Author

tnqn commented Apr 18, 2023

/test-windows-proxyall-e2e
/test-windows-containerd-e2e
/test-windows-containerd-conformance

@tnqn tnqn merged commit 8444b4c into antrea-io:main Apr 19, 2023
@tnqn tnqn deleted the reduce-antrea-proxy-redundance branch April 19, 2023 01:46
@luolanzone luolanzone mentioned this pull request Apr 19, 2023
jainpulkit22 pushed a commit to urharshitha/antrea that referenced this pull request Apr 28, 2023
Currently the installServices method is somewhat long and redundant,
making it hard to maintain and error-prone. In fact, a few bugs were
found in the recent releases which are related to it more or less. While
sorting out the code, I found there are actually more bugs in it:

1. After updating stickyMaxAgeSeconds, the flow for ClusterIP isn't
   updated because the installServiceFlows interface skip updating flows
   whose cache keys already exist.
2. After updating stickyMaxAgeSeconds, the flows for NodePort and
   LoadBalancerIPs aren't updated because the installServiceFlows
   interface is not even called.
3. After updating InternalTrafficPolicy, the flows for ClusterIP isn't
   updated.
4. After updating InternalTrafficPolicy for a ClusterIP Service, the
   stale group is not removed.
5. Endpoints are installed repeatedly even though there are already
   reference counters for them.

This patch tries to refactor the method to make it eaiser to understand
and maintain, and fixes all the above bugs. It makes the following
changes:

1. Code redundancy is reduced with some shareable sub-procedures being
   extracted to sub-functions.
2. Calculation of Variables that are required by a sub-procedure only
   are moved to the corresponding sub-function.
2. Repeated code that retrieves the group IDs are removed.
3. The ways of processing ClusterIP, NodePort, and LoadBalancerIPs are
   unified.
4. A method for installing Endpoints in the same way as uninstalling
   Endpoints is added.
5. Use needUpdateService to represent all the flows of the Service need
   update, and use needUpdateServiceExternalAddresses to represent only
   the flows related to ExternalAddresses need update.

Signed-off-by: Quan Tian <qtian@vmware.com>
ceclinux pushed a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
Currently the installServices method is somewhat long and redundant,
making it hard to maintain and error-prone. In fact, a few bugs were
found in the recent releases which are related to it more or less. While
sorting out the code, I found there are actually more bugs in it:

1. After updating stickyMaxAgeSeconds, the flow for ClusterIP isn't
   updated because the installServiceFlows interface skip updating flows
   whose cache keys already exist.
2. After updating stickyMaxAgeSeconds, the flows for NodePort and
   LoadBalancerIPs aren't updated because the installServiceFlows
   interface is not even called.
3. After updating InternalTrafficPolicy, the flows for ClusterIP isn't
   updated.
4. After updating InternalTrafficPolicy for a ClusterIP Service, the
   stale group is not removed.
5. Endpoints are installed repeatedly even though there are already
   reference counters for them.

This patch tries to refactor the method to make it eaiser to understand
and maintain, and fixes all the above bugs. It makes the following
changes:

1. Code redundancy is reduced with some shareable sub-procedures being
   extracted to sub-functions.
2. Calculation of Variables that are required by a sub-procedure only
   are moved to the corresponding sub-function.
2. Repeated code that retrieves the group IDs are removed.
3. The ways of processing ClusterIP, NodePort, and LoadBalancerIPs are
   unified.
4. A method for installing Endpoints in the same way as uninstalling
   Endpoints is added.
5. Use needUpdateService to represent all the flows of the Service need
   update, and use needUpdateServiceExternalAddresses to represent only
   the flows related to ExternalAddresses need update.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/proxy Issues or PRs related to proxy functions in Antrea kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants