NFD Master memory leak #1841

subnet-dev · 2024-08-18T13:56:34Z

What happened:
We use node feature discovery 0.16.4 on our GPU cluster in addition to NVIDIA's gpu-operator. The GPU operator is installed with NFD disabled so that these tools can be managed independently. Node Feature Discovery Master continuously utilises more memory and never release it. Memory usage increases by around 1.2 Gi/day in a very linear way for the master, which is the leader. I've set a memory limit of 4Gi to prevent the pod from using more memory. The configuration file nfd-master.conf is empty (nfd-master.conf: ‘null’).

This screenshot was taken before rolling out the deployment:

This screenshot corresponds to the last pod start-up:

Here are the master logs since start-up:

I0817 19:08:35.561209       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0817 19:08:35.562048       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.4" nodeName="admin007.xxx" namespace="node-feature-discovery"
I0817 19:08:35.562145       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0817 19:08:35.562459       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0817 19:08:35.562565       1 nfd-master.go:1513] "starting the nfd api controller"
I0817 19:08:36.363181       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0817 19:08:36.369926       1 metrics.go:44] "metrics server starting" port=":8081"
I0817 19:08:36.370063       1 leaderelection.go:250] attempting to acquire leader lease node-feature-discovery/nfd-master.nfd.kubernetes.io...
I0817 19:08:36.370299       1 component.go:36] [core][Server #1]Server created
I0817 19:08:36.370348       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0817 19:08:36.370445       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created

What you expected to happen:
NFD Master to have a steady memory usage.

How to reproduce it (as minimally and precisely as possible):
Nous utilisons flux pour deployer le chart helm mais voici un équivalant avec les mêmes valeurs.

export NFD_NS=node-feature-discovery
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo update
helm install nfd/node-feature-discovery node-feature-discovery --namespace node-feature-discovery --create-namespace --set master.replicaCount=2 --set gc.replicaCount: 2

Anything else we need to know?:
This is the installation of a new cluster, so I don't know if this problem appeared since version 0.16.4 or if it was already present before.

Environment:

Kubernetes version (use kubectl version): v1.28.12
Cloud provider or hardware configuration:
- HPE Apollo 6500 Gen10 Plus
- 2 x AMD EPYC 7543 32-Core
- 32 x 32GB DDR4
- 8 x A100 SXM 80GB
- Mellanox ConnectX-6 Dx 100GbE
OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
Kernel (e.g. uname -a): Linux 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug): Cillium 1.15.7
Others:
- gpu-operator v23.9.2

The text was updated successfully, but these errors were encountered:

marquiz · 2024-08-19T07:29:36Z

@subnet-dev thank you for reporting the issue. Looks strange (and severe) indeed 🤔

I'm trying to reproduce the issue but with no success so far. @subnet-dev would it be possible for you to test with replicaCount=1, and possibly also with some other NFD version (e.g. v0.16.0)?

ping @ArangoGutierrez

EDIT: @subnet-dev how does the log of the other nfd-master pod look like?

subnet-dev · 2024-08-19T08:35:42Z

Hi @marquiz thanks for your quick reply.

I have a second test cluster with exactly the same tools/version of kubernetes deployed where I don't have this issue.

I will change the number of replicas to 1 for the number of master and gc.
I can try to rollback to v0.16.3 or v0.16.2 but not with an older version because the gpu operator restarts the containers containing the Nvidia drivers when the labels are removed (Bug fixed in v0.16.2).

Here are the logs for the second master:

I0812 12:51:08.875704       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0812 12:51:08.876388       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.4" nodeName="admin006.xxx" namespace="node-feature-discovery"
I0812 12:51:08.876461       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0812 12:51:08.876760       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0812 12:51:08.876851       1 nfd-master.go:1513] "starting the nfd api controller"
I0812 12:51:09.477767       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0812 12:51:09.484095       1 metrics.go:44] "metrics server starting" port=":8081"
I0812 12:51:09.484203       1 leaderelection.go:250] attempting to acquire leader lease node-feature-discovery/nfd-master.nfd.kubernetes.io...
I0812 12:51:09.484515       1 component.go:36] [core][Server #1]Server created
I0812 12:51:09.484584       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0812 12:51:09.484694       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0812 12:51:33.511561       1 leaderelection.go:260] successfully acquired lease node-feature-discovery/nfd-master.nfd.kubernetes.io
I0812 12:51:34.512850       1 nfd-master.go:792] "will process all nodes in the cluster"
I0815 14:01:42.143896       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:10:51.600735       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:13:51.691317       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"

I also see that I have error logs on the workers (perhaps the problem is unrelated):

E0819 08:15:32.546843       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:15:32.546891       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:15:32.635852       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:15:32.636224       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:16:32.545173       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:16:32.545253       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:16:32.626492       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:16:32.626856       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:17:32.546733       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:17:32.546791       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:17:32.632990       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:17:32.633790       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:18:32.547596       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:18:32.547643       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:18:32.622336       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:18:32.623088       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:19:32.546890       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:19:32.546935       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:19:32.618280       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:19:32.619050       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:20:32.546244       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:20:32.546290       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:20:32.627921       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:20:32.628278       1 nfd-worker.go:677] "feature discovery completed"

Perhaps an important piece of information: the cluster is now made up of 12 servers.

marquiz · 2024-08-19T10:33:56Z

@subnet-dev is the second cluster (without issues) also running with replicas=2?

BTW, you could also join the #node-feature-discovery slack channel to discuss the issue (with faster communication round-trip 😄)

marquiz · 2024-08-19T15:49:26Z

@subnet-dev @ArangoGutierrez I have a strong suspicion that the problem is related to the awry interplay between the leader election, nfd api informers and the dynamic reconfiguration. I hope running with with replicas=1 would resolve the most immediate issue. #1847 is (one alternative) to resolve this for future releases. I'm running some tests to see if my hyphotesis looks correct...

subnet-dev added the kind/bug Categorizes issue or PR as related to a bug. label Aug 18, 2024

marquiz mentioned this issue Aug 20, 2024

nfd-master: proper shutdown of nfd api informers #1848

Merged

k8s-ci-robot closed this as completed in #1848 Aug 20, 2024

marquiz mentioned this issue Aug 20, 2024

Release v0.16.5 #1849

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFD Master memory leak #1841

NFD Master memory leak #1841

subnet-dev commented Aug 18, 2024 •

edited

Loading

marquiz commented Aug 19, 2024 •

edited

Loading

subnet-dev commented Aug 19, 2024

marquiz commented Aug 19, 2024

marquiz commented Aug 19, 2024

NFD Master memory leak #1841

NFD Master memory leak #1841

Comments

subnet-dev commented Aug 18, 2024 • edited Loading

marquiz commented Aug 19, 2024 • edited Loading

subnet-dev commented Aug 19, 2024

marquiz commented Aug 19, 2024

marquiz commented Aug 19, 2024

subnet-dev commented Aug 18, 2024 •

edited

Loading

marquiz commented Aug 19, 2024 •

edited

Loading