nfd-master: tweak list options for NodeFeature informer #1811

marquiz · 2024-07-25T13:44:24Z

Fix cache syncing problems on big clusters with thousands of NodeFeature objects.

On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like:

E writers.go:122] apiserver was unable to write a JSON response: http:
                  Handler timeout
E status.go:71] apiserver received an error that is not an
                metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
                http: Handler timeout

On the nfd-master side we see corresponding log snippets like:

W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
                    when reading response body, may be caused by closed
                    connection. Please retry. Original error: stream
                    error: stream ID 1521; INTERNAL_ERROR; received from
                    peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
                61126ms): ---"Objects listed" error:stream error when
                reading response body, may be caused by closed
                connection. Please retry. Original error: stream
                error: stream ID 1521; INTERNAL_ERROR; received from
                peer 61126ms (***)

Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts.

TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".

Fix cache syncing problems on big clusters with thousands of NodeFeature objects. On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like: E writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout E status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout On the nfd-master side we see corresponding log snippets like: W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time: 61126ms): ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer 61126ms (***) Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts. TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".

k8s-ci-robot · 2024-07-25T13:44:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [marquiz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-07-25T13:44:42Z

✅ Deploy Preview for kubernetes-sigs-nfd ready!

Name	Link
🔨 Latest commit	`a2068f7`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/66a256bc551f08000854ee34
😎 Deploy Preview	https://deploy-preview-1811--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

marquiz · 2024-07-25T13:45:32Z

/assign @ArangoGutierrez

/cc @ahmetb @lxlxok

k8s-ci-robot · 2024-07-25T13:45:36Z

@marquiz: GitHub didn't allow me to request PR reviews from the following users: lxlxok.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @ArangoGutierrez

/cc @ahmetb @lxlxok

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahmetb · 2024-07-25T16:49:13Z

pkg/nfd-master/nfd-api-controller.go

+			// Tweak list opts on initial sync to avoid timeouts on the apiserver.
+			// NodeFeature objects are huge and the Kubernetes apiserver
+			// (v1.30) experiences http handler timeouts when the resource
+			// version is set to some non-empty value (TODO: find out why).


I assume

https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/1904-efficient-watch-resumption/README.md

Watch with resourceVersion="" can take down control plane kubernetes/kubernetes#123448

This shouldn't happen as we basically only set rv="" on the initial LIST operation. When setting up the watches RV is always set >0

ahmetb · 2024-07-25T16:49:42Z

pkg/nfd-master/nfd-api-controller.go

+				opts.ResourceVersion = ""
+			}
+		}
+		featureInformer := nfdinformersv1alpha1.New(informerFactory, "", tweakListOpts).NodeFeatures()


Q: doesn't nfd-gc need this tweak too?

Indeed, needed there! I will update

Thanks @ahmetb for the note. #1815 is the outcome of this

PiotrProkop

/lgtm

k8s-ci-robot · 2024-07-30T10:25:23Z

LGTM label has been added.

Git tree hash: 4f80c9891f1e6a3d5e5ed26d824008a0b7ed0ad3

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 25, 2024

k8s-ci-robot requested review from jjacobelli and PiotrProkop July 25, 2024 13:44

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 25, 2024

k8s-ci-robot assigned ArangoGutierrez Jul 25, 2024

k8s-ci-robot requested a review from ahmetb July 25, 2024 13:45

marquiz mentioned this pull request Jul 25, 2024

[v0.16.0] All node labels removed while informer cache failed to sync #1802

Open

ahmetb reviewed Jul 25, 2024

View reviewed changes

PiotrProkop reviewed Jul 30, 2024

View reviewed changes

k8s-ci-robot assigned PiotrProkop Jul 30, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2024

k8s-ci-robot merged commit 2d24a4b into kubernetes-sigs:master Jul 30, 2024
10 checks passed

marquiz deleted the devel/informer-listopts branch July 30, 2024 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nfd-master: tweak list options for NodeFeature informer #1811

nfd-master: tweak list options for NodeFeature informer #1811

marquiz commented Jul 25, 2024 •

edited

Loading

k8s-ci-robot commented Jul 25, 2024

netlify bot commented Jul 25, 2024 •

edited

Loading

marquiz commented Jul 25, 2024

k8s-ci-robot commented Jul 25, 2024

ahmetb Jul 25, 2024

marquiz Jul 26, 2024

ahmetb Jul 25, 2024

marquiz Jul 26, 2024

marquiz Jul 26, 2024

PiotrProkop left a comment

k8s-ci-robot commented Jul 30, 2024

nfd-master: tweak list options for NodeFeature informer #1811

nfd-master: tweak list options for NodeFeature informer #1811

Conversation

marquiz commented Jul 25, 2024 • edited Loading

k8s-ci-robot commented Jul 25, 2024

netlify bot commented Jul 25, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-nfd ready!

marquiz commented Jul 25, 2024

k8s-ci-robot commented Jul 25, 2024

ahmetb Jul 25, 2024

Choose a reason for hiding this comment

marquiz Jul 26, 2024

Choose a reason for hiding this comment

ahmetb Jul 25, 2024

Choose a reason for hiding this comment

marquiz Jul 26, 2024

Choose a reason for hiding this comment

marquiz Jul 26, 2024

Choose a reason for hiding this comment

PiotrProkop left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 30, 2024

marquiz commented Jul 25, 2024 •

edited

Loading

netlify bot commented Jul 25, 2024 •

edited

Loading