-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nfd-master: tweak list options for NodeFeature informer #1811
nfd-master: tweak list options for NodeFeature informer #1811
Conversation
Fix cache syncing problems on big clusters with thousands of NodeFeature objects. On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like: E writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout E status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout On the nfd-master side we see corresponding log snippets like: W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time: 61126ms): ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer 61126ms (***) Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts. TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marquiz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/assign @ArangoGutierrez |
@marquiz: GitHub didn't allow me to request PR reviews from the following users: lxlxok. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
// Tweak list opts on initial sync to avoid timeouts on the apiserver. | ||
// NodeFeature objects are huge and the Kubernetes apiserver | ||
// (v1.30) experiences http handler timeouts when the resource | ||
// version is set to some non-empty value (TODO: find out why). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't happen as we basically only set rv="" on the initial LIST operation. When setting up the watches RV is always set >0
opts.ResourceVersion = "" | ||
} | ||
} | ||
featureInformer := nfdinformersv1alpha1.New(informerFactory, "", tweakListOpts).NodeFeatures() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: doesn't nfd-gc need this tweak too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, needed there! I will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 4f80c9891f1e6a3d5e5ed26d824008a0b7ed0ad3
|
Fix cache syncing problems on big clusters with thousands of NodeFeature objects.
On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like:
On the nfd-master side we see corresponding log snippets like:
Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts.
TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".