Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.16.0] All node labels removed while informer cache failed to sync #1802

Open
ahmetb opened this issue Jul 19, 2024 · 9 comments
Open

[v0.16.0] All node labels removed while informer cache failed to sync #1802

ahmetb opened this issue Jul 19, 2024 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ahmetb
Copy link
Member

ahmetb commented Jul 19, 2024

What happened:

While migrating from 0.10 to 0.16.0:

  • all node feature labels got removed
  • kubectl get nodefeatures -n node-feature-discovery was unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.

Anything else we need to know?:

There were extensive informer sync errors in nfd-master logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).

Attaching logs: nfd-master.log

My suspicion is that the nfd-master somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)

💡 We don't see the issue on much smaller clusters.

💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).

Environment:

  • Kubernetes version (use kubectl version): v1.23.17
  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g: cat /etc/os-release): Not applicable
  • Kernel (e.g. uname -a): Not applicable
  • Install tools: installed via Helm
  • Network plugin and version (if this is a network-related bug): Not applicable
  • Others: Not applicable
@ahmetb ahmetb added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2024
@marquiz
Copy link
Contributor

marquiz commented Jul 23, 2024

I think this happens when upgrading from NFD v0.13 or earlier. The specific reason is the NodeFeature CRD API that was introduced in v0.14 (earlier versions of NFD relied on gRPC communication between nfd-master and nfd-worker instances).

What happens (I think) is that nfd-master starts quickly (and syncs NodeFeature cache) as there are no NodeFeature objects in the cluster. And, bc of that it starts removing the labels from nodes (as it sees no features for them published by the nfd-worker instances). At the same time the nfd-worker instances start and start pushing NodeFeature objects to the apiserver -> there are ton of them and the cache throws timeout errors. Eventually things "stabilize" and node labels are back.

The thing to "fix" this would be upgrading nfd-workers first, so that the NodeFeature objects are created before nfd-master kicks in. Maybe a migration guide from pre-v0.14?

@marquiz
Copy link
Contributor

marquiz commented Jul 23, 2024

@ahmetb PTAL #1807

@lxlxok
Copy link

lxlxok commented Jul 23, 2024

Hi @marquiz,

Let me provide more context about the situation:

  1. NFD 0.16.0 was deployed in the cluster with master.config.noPublish=true on 7/15.
  2. NFD 0.16.0 was redeployed in the cluster with master.config.noPublish=false, and the nfd-worker DaemonSet pod was also updated on 7/19.

The logs indicate that the nfd-master pod failed to list the NodeFeature CRD due to a 1-minute request timeout, although the NodeFeature CRDs do exist in the cluster. The large size of the CRD likely contributed to this timeout.

I0719 15:02:17.382982       1 trace.go:236] Trace[1302901114]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (19-Jul-2024 15:01:16.878) (total time: 60504ms):
Trace[1302901114]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR; received from peer 60504ms (15:02:17.382)
Trace[1302901114]: [1m0.504227078s] [1m0.504227078s] END

In this situation, the nfd-master removes all the labels from nodes and does not add them back.

The nfd-master invokes the node label updates here. The labels were removed because an empty NodeFeature CRD was created here and passed to that function.

It seems the current logic assumes that the NodeFeature CRD missing in the sharedInformer cache is equivalent to the NodeFeature being missing in the cluster. This causes all the labels to be removed and not added back.

@marquiz
Copy link
Contributor

marquiz commented Jul 24, 2024

@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.

The timeout is another issue, though. That needs to be investigated/solved. Looking at the log that @ahmetb linked nfd-master tried syncing caches for 1,5 hours without succeeding.

@ahmetb
Copy link
Member Author

ahmetb commented Jul 25, 2024

@marquiz any ideas why the logs show "timeout" for informer sync many minutes later the logs say "node updated"?

I'm trying to understand if logs show any indication of cache sync ever succeeding, because the program goes into updating nodes shortly after booting.

@lxlxok
Copy link

lxlxok commented Jul 25, 2024

@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.

Hi @marquiz It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.

This also explain why updaterPool start immediately after m.startNfdApiController at here at 04:29:46. The cache of NodeFeature was populated at 04:30:01.

We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.

I0725 04:29:46.988600       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0725 04:29:46.989165       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.0" nodeName="kind-control-plane" namespace="node-feature-discovery"
I0725 04:29:46.989241       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.989639       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0725 04:29:46.989910       1 nfd-master.go:1513] "starting the nfd api controller"
I0725 04:29:46.989993       1 nfd-api-controller.go:70] "initializing new NFD API controller" options=<
	DisableNodeFeature: false
	DisableNodeFeatureGroup: false
	ResyncPeriod: 3600000000000
 >
I0725 04:29:46.990041       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0725 04:29:46.990191       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureRule (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990211       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990191       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeature (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990226       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990383       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.990395       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery"
I0725 04:29:46.990402       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes"
I0725 04:29:46.990407       1 fswatcher.go:99] "added fsnotify watch" path="/etc"
I0725 04:29:46.990411       1 fswatcher.go:99] "added fsnotify watch" path="/"
I0725 04:29:46.990422       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureGroup (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990430       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994508       1 metrics.go:44] "metrics server starting" port=":8081"
I0725 04:29:46.994571       1 component.go:36] [core][Server #1]Server created
I0725 04:29:46.994589       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0725 04:29:46.994633       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0725 04:29:47.995621       1 nfd-master.go:792] "will process all nodes in the cluster"
I0725 04:29:47.998358       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-control-plane" objectCount=0 duration="83ns"
I0725 04:29:47.998378       1 nfd-master.go:1285] "no updates to node" nodeName="kind-control-plane"
I0725 04:29:47.998799       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker3" objectCount=0 duration="84ns"
I0725 04:29:47.998811       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker3"
I0725 04:29:47.998883       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker2" objectCount=0 duration="83ns"
I0725 04:29:47.998889       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker" objectCount=0 duration="83ns"
I0725 04:29:47.998895       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker2"
I0725 04:29:47.998896       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker"
I0725 04:29:56.762856       1 component.go:36] [transport][server-transport 0x400046b980] Closing: EOF
I0725 04:29:56.762900       1 component.go:36] [transport][server-transport 0x400046b980] loopyWriter exiting with error: transport closed by client
I0725 04:29:56.763042       1 component.go:36] [transport][server-transport 0x4000924600] Closing: EOF
I0725 04:29:56.763072       1 component.go:36] [transport][server-transport 0x4000924600] loopyWriter exiting with error: transport closed by client
I0725 04:30:01.288935       1 trace.go:236] Trace[1887133826]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (25-Jul-2024 04:29:46.990) (total time: 14298ms):
Trace[1887133826]: ---"Objects listed" error:<nil> 14296ms (04:30:01.286)
Trace[1887133826]: ---"Resource version extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"Objects extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"SyncWith done" 1ms (04:30:01.288)
Trace[1887133826]: ---"Resource version updated" 0ms (04:30:01.288)
Trace[1887133826]: [14.298920589s] [14.298920589s] END
I0725 04:30:01.288971       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:30:01.289381       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker"
I0725 04:30:01.289398       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker2"
I0725 04:30:01.289401       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker3"

@marquiz
Copy link
Contributor

marquiz commented Jul 25, 2024

It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.

That DOES block until the caches have synced. However, in v0.16.0 we don't call WaitForCacheSync()

We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.

You won't see this kind of log with the latest version (v0.16.3). However, there are other issues you are likely to face. I submitted #1810 and #1811 to mitigate those.

@ahmetb
Copy link
Member Author

ahmetb commented Jul 25, 2024

That DOES block until the caches have synced. However, in v0.16.0 we don't call WaitForCacheSync()

@marquiz thanks for the responses. Would the lack of WaitForCacheSync() combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)

I see v0.16.3 actually does have a WaitForCacheSync(), maybe that would've made the difference.

By the way nfd-gc also does not check for the return value here:

n.factory.WaitForCacheSync(n.stopChan)

@marquiz
Copy link
Contributor

marquiz commented Jul 26, 2024

Would the lack of WaitForCacheSync() combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)

Yes, exactly that.

I see v0.16.3 actually does have a WaitForCacheSync(), maybe that would've made the difference.

It should've. But I suspect that in a BIG cluster with v0.16.3 you'll hit the "timeout problem" (that #1811 addresses) and WaitForCacheSync would block indefinitely.

By the way nfd-gc also does not check for the return value here:

Thanks for pointing out that. It's not critical there but I think it's good to be consistent and start from a known/synced state. I'll submit a PR to fix that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants