[v0.16.0] All node labels removed while informer cache failed to sync #1802

ahmetb · 2024-07-19T18:48:31Z

What happened:

While migrating from 0.10 to 0.16.0:

all node feature labels got removed
kubectl get nodefeatures -n node-feature-discovery was unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.

Anything else we need to know?:

There were extensive informer sync errors in nfd-master logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).

Attaching logs: nfd-master.log

My suspicion is that the nfd-master somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)

💡 We don't see the issue on much smaller clusters.

💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).

Environment:

Kubernetes version (use kubectl version): v1.23.17
Cloud provider or hardware configuration: Bare-metal
OS (e.g: cat /etc/os-release): Not applicable
Kernel (e.g. uname -a): Not applicable
Install tools: installed via Helm
Network plugin and version (if this is a network-related bug): Not applicable
Others: Not applicable

The text was updated successfully, but these errors were encountered:

marquiz · 2024-07-23T08:36:01Z

I think this happens when upgrading from NFD v0.13 or earlier. The specific reason is the NodeFeature CRD API that was introduced in v0.14 (earlier versions of NFD relied on gRPC communication between nfd-master and nfd-worker instances).

What happens (I think) is that nfd-master starts quickly (and syncs NodeFeature cache) as there are no NodeFeature objects in the cluster. And, bc of that it starts removing the labels from nodes (as it sees no features for them published by the nfd-worker instances). At the same time the nfd-worker instances start and start pushing NodeFeature objects to the apiserver -> there are ton of them and the cache throws timeout errors. Eventually things "stabilize" and node labels are back.

The thing to "fix" this would be upgrading nfd-workers first, so that the NodeFeature objects are created before nfd-master kicks in. Maybe a migration guide from pre-v0.14?

marquiz · 2024-07-23T13:23:22Z

@ahmetb PTAL #1807

lxlxok · 2024-07-23T17:59:39Z

Hi @marquiz,

Let me provide more context about the situation:

NFD 0.16.0 was deployed in the cluster with master.config.noPublish=true on 7/15.
NFD 0.16.0 was redeployed in the cluster with master.config.noPublish=false, and the nfd-worker DaemonSet pod was also updated on 7/19.

The logs indicate that the nfd-master pod failed to list the NodeFeature CRD due to a 1-minute request timeout, although the NodeFeature CRDs do exist in the cluster. The large size of the CRD likely contributed to this timeout.

I0719 15:02:17.382982       1 trace.go:236] Trace[1302901114]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (19-Jul-2024 15:01:16.878) (total time: 60504ms):
Trace[1302901114]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR; received from peer 60504ms (15:02:17.382)
Trace[1302901114]: [1m0.504227078s] [1m0.504227078s] END

In this situation, the nfd-master removes all the labels from nodes and does not add them back.

The nfd-master invokes the node label updates here. The labels were removed because an empty NodeFeature CRD was created here and passed to that function.

It seems the current logic assumes that the NodeFeature CRD missing in the sharedInformer cache is equivalent to the NodeFeature being missing in the cluster. This causes all the labels to be removed and not added back.

marquiz · 2024-07-24T15:48:02Z

@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.

The timeout is another issue, though. That needs to be investigated/solved. Looking at the log that @ahmetb linked nfd-master tried syncing caches for 1,5 hours without succeeding.

ahmetb · 2024-07-25T03:22:23Z

@marquiz any ideas why the logs show "timeout" for informer sync many minutes later the logs say "node updated"?

I'm trying to understand if logs show any indication of cache sync ever succeeding, because the program goes into updating nodes shortly after booting.

lxlxok · 2024-07-25T05:37:00Z

@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.

Hi @marquiz It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.

This also explain why updaterPool start immediately after m.startNfdApiController at here at 04:29:46. The cache of NodeFeature was populated at 04:30:01.

We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.

I0725 04:29:46.988600       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0725 04:29:46.989165       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.0" nodeName="kind-control-plane" namespace="node-feature-discovery"
I0725 04:29:46.989241       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.989639       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0725 04:29:46.989910       1 nfd-master.go:1513] "starting the nfd api controller"
I0725 04:29:46.989993       1 nfd-api-controller.go:70] "initializing new NFD API controller" options=<
	DisableNodeFeature: false
	DisableNodeFeatureGroup: false
	ResyncPeriod: 3600000000000
 >
I0725 04:29:46.990041       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0725 04:29:46.990191       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureRule (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990211       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990191       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeature (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990226       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990383       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.990395       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery"
I0725 04:29:46.990402       1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes"
I0725 04:29:46.990407       1 fswatcher.go:99] "added fsnotify watch" path="/etc"
I0725 04:29:46.990411       1 fswatcher.go:99] "added fsnotify watch" path="/"
I0725 04:29:46.990422       1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureGroup (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990430       1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994508       1 metrics.go:44] "metrics server starting" port=":8081"
I0725 04:29:46.994571       1 component.go:36] [core][Server #1]Server created
I0725 04:29:46.994589       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0725 04:29:46.994633       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0725 04:29:47.995621       1 nfd-master.go:792] "will process all nodes in the cluster"
I0725 04:29:47.998358       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-control-plane" objectCount=0 duration="83ns"
I0725 04:29:47.998378       1 nfd-master.go:1285] "no updates to node" nodeName="kind-control-plane"
I0725 04:29:47.998799       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker3" objectCount=0 duration="84ns"
I0725 04:29:47.998811       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker3"
I0725 04:29:47.998883       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker2" objectCount=0 duration="83ns"
I0725 04:29:47.998889       1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker" objectCount=0 duration="83ns"
I0725 04:29:47.998895       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker2"
I0725 04:29:47.998896       1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker"
I0725 04:29:56.762856       1 component.go:36] [transport][server-transport 0x400046b980] Closing: EOF
I0725 04:29:56.762900       1 component.go:36] [transport][server-transport 0x400046b980] loopyWriter exiting with error: transport closed by client
I0725 04:29:56.763042       1 component.go:36] [transport][server-transport 0x4000924600] Closing: EOF
I0725 04:29:56.763072       1 component.go:36] [transport][server-transport 0x4000924600] loopyWriter exiting with error: transport closed by client
I0725 04:30:01.288935       1 trace.go:236] Trace[1887133826]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (25-Jul-2024 04:29:46.990) (total time: 14298ms):
Trace[1887133826]: ---"Objects listed" error:<nil> 14296ms (04:30:01.286)
Trace[1887133826]: ---"Resource version extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"Objects extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"SyncWith done" 1ms (04:30:01.288)
Trace[1887133826]: ---"Resource version updated" 0ms (04:30:01.288)
Trace[1887133826]: [14.298920589s] [14.298920589s] END
I0725 04:30:01.288971       1 reflector.go:359] Caches populated for *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:30:01.289381       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker"
I0725 04:30:01.289398       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker2"
I0725 04:30:01.289401       1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker3"

marquiz · 2024-07-25T14:07:31Z

It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.

That DOES block until the caches have synced. However, in v0.16.0 we don't call WaitForCacheSync()

We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.

You won't see this kind of log with the latest version (v0.16.3). However, there are other issues you are likely to face. I submitted #1810 and #1811 to mitigate those.

ahmetb · 2024-07-25T15:54:21Z

That DOES block until the caches have synced. However, in v0.16.0 we don't call WaitForCacheSync()

@marquiz thanks for the responses. Would the lack of WaitForCacheSync() combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)

I see v0.16.3 actually does have a WaitForCacheSync(), maybe that would've made the difference.

By the way nfd-gc also does not check for the return value here:

node-feature-discovery/pkg/nfd-gc/nfd-gc.go

Line 211 in bd8d74d

n.factory.WaitForCacheSync(n.stopChan)

marquiz · 2024-07-26T07:24:29Z

Would the lack of WaitForCacheSync() combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)

Yes, exactly that.

I see v0.16.3 actually does have a WaitForCacheSync(), maybe that would've made the difference.

It should've. But I suspect that in a BIG cluster with v0.16.3 you'll hit the "timeout problem" (that #1811 addresses) and WaitForCacheSync would block indefinitely.

By the way nfd-gc also does not check for the return value here:

Thanks for pointing out that. It's not critical there but I think it's good to be consistent and start from a known/synced state. I'll submit a PR to fix that one

ahmetb added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2024

ahmetb mentioned this issue Jul 25, 2024

nfd-master: check nfd api informer cache sync result #1809

Merged

ahmetb mentioned this issue Jul 26, 2024

NodeFeature resource should not be owned by Pod #1817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.16.0] All node labels removed while informer cache failed to sync #1802

[v0.16.0] All node labels removed while informer cache failed to sync #1802

ahmetb commented Jul 19, 2024

marquiz commented Jul 23, 2024

marquiz commented Jul 23, 2024

lxlxok commented Jul 23, 2024 •

edited

Loading

marquiz commented Jul 24, 2024

ahmetb commented Jul 25, 2024

lxlxok commented Jul 25, 2024

marquiz commented Jul 25, 2024

ahmetb commented Jul 25, 2024

marquiz commented Jul 26, 2024

[v0.16.0] All node labels removed while informer cache failed to sync #1802

[v0.16.0] All node labels removed while informer cache failed to sync #1802

Comments

ahmetb commented Jul 19, 2024

marquiz commented Jul 23, 2024

marquiz commented Jul 23, 2024

lxlxok commented Jul 23, 2024 • edited Loading

marquiz commented Jul 24, 2024

ahmetb commented Jul 25, 2024

lxlxok commented Jul 25, 2024

marquiz commented Jul 25, 2024

ahmetb commented Jul 25, 2024

marquiz commented Jul 26, 2024

lxlxok commented Jul 23, 2024 •

edited

Loading