Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nfd-master: tweak list options for NodeFeature informer #1811

Merged

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Jul 25, 2024

Fix cache syncing problems on big clusters with thousands of NodeFeature objects.

On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like:

E writers.go:122] apiserver was unable to write a JSON response: http:
                  Handler timeout
E status.go:71] apiserver received an error that is not an
                metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
                http: Handler timeout

On the nfd-master side we see corresponding log snippets like:

W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
                    when reading response body, may be caused by closed
                    connection. Please retry. Original error: stream
                    error: stream ID 1521; INTERNAL_ERROR; received from
                    peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
                61126ms): ---"Objects listed" error:stream error when
                reading response body, may be caused by closed
                connection. Please retry. Original error: stream
                error: stream ID 1521; INTERNAL_ERROR; received from
                peer 61126ms (***)

Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts.

TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".

Fix cache syncing problems on big clusters with thousands of NodeFeature
objects.

On the initial list (sync) the client-go cache reflector sets the
ResourceVersion to "0" (instead of leaving it empty). This causes
problems in the api server with (apiserver) logs like:

E writers.go:122] apiserver was unable to write a JSON response: http:
                  Handler timeout
E status.go:71] apiserver received an error that is not an
                metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
                http: Handler timeout

On the nfd-master side we see corresponding log snippets like:

W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
                    when reading response body, may be caused by closed
                    connection. Please retry. Original error: stream
                    error: stream ID 1521; INTERNAL_ERROR; received from
                    peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
                61126ms): ---"Objects listed" error:stream error when
                reading response body, may be caused by closed
                connection. Please retry. Original error: stream
                error: stream ID 1521; INTERNAL_ERROR; received from
                peer 61126ms (***)

Decreasing the page size (opts.Limits) does not have any effect on the
timeouts. However, setting ResourceVersion to an empty value seems to
get the paging on its tracks, eliminating the timeouts.

TODO: investigate in Kubernetes upstream the root cause of the timeouts
with ResourceVersion="0".
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 25, 2024
Copy link

netlify bot commented Jul 25, 2024

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit a2068f7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/66a256bc551f08000854ee34
😎 Deploy Preview https://deploy-preview-1811--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@marquiz
Copy link
Contributor Author

marquiz commented Jul 25, 2024

/assign @ArangoGutierrez

/cc @ahmetb @lxlxok

@k8s-ci-robot
Copy link
Contributor

@marquiz: GitHub didn't allow me to request PR reviews from the following users: lxlxok.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @ArangoGutierrez

/cc @ahmetb @lxlxok

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// Tweak list opts on initial sync to avoid timeouts on the apiserver.
// NodeFeature objects are huge and the Kubernetes apiserver
// (v1.30) experiences http handler timeouts when the resource
// version is set to some non-empty value (TODO: find out why).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't happen as we basically only set rv="" on the initial LIST operation. When setting up the watches RV is always set >0

opts.ResourceVersion = ""
}
}
featureInformer := nfdinformersv1alpha1.New(informerFactory, "", tweakListOpts).NodeFeatures()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: doesn't nfd-gc need this tweak too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, needed there! I will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahmetb for the note. #1815 is the outcome of this

Copy link
Contributor

@PiotrProkop PiotrProkop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4f80c9891f1e6a3d5e5ed26d824008a0b7ed0ad3

@k8s-ci-robot k8s-ci-robot merged commit 2d24a4b into kubernetes-sigs:master Jul 30, 2024
10 checks passed
@marquiz marquiz deleted the devel/informer-listopts branch July 30, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants