Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFD Master memory leak #1841

Closed
subnet-dev opened this issue Aug 18, 2024 · 4 comments · Fixed by #1848
Closed

NFD Master memory leak #1841

subnet-dev opened this issue Aug 18, 2024 · 4 comments · Fixed by #1848
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@subnet-dev
Copy link

subnet-dev commented Aug 18, 2024

What happened:
We use node feature discovery 0.16.4 on our GPU cluster in addition to NVIDIA's gpu-operator. The GPU operator is installed with NFD disabled so that these tools can be managed independently. Node Feature Discovery Master continuously utilises more memory and never release it. Memory usage increases by around 1.2 Gi/day in a very linear way for the master, which is the leader. I've set a memory limit of 4Gi to prevent the pod from using more memory. The configuration file nfd-master.conf is empty (nfd-master.conf: ‘null’).

This screenshot was taken before rolling out the deployment:
Screenshot 2024-08-18 at 15 50 00

This screenshot corresponds to the last pod start-up:
Screenshot 2024-08-18 at 15 55 07

Here are the master logs since start-up:

I0817 19:08:35.561209       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0817 19:08:35.562048       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.4" nodeName="admin007.xxx" namespace="node-feature-discovery"
I0817 19:08:35.562145       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0817 19:08:35.562459       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0817 19:08:35.562565       1 nfd-master.go:1513] "starting the nfd api controller"
I0817 19:08:36.363181       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0817 19:08:36.369926       1 metrics.go:44] "metrics server starting" port=":8081"
I0817 19:08:36.370063       1 leaderelection.go:250] attempting to acquire leader lease node-feature-discovery/nfd-master.nfd.kubernetes.io...
I0817 19:08:36.370299       1 component.go:36] [core][Server #1]Server created
I0817 19:08:36.370348       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0817 19:08:36.370445       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created

What you expected to happen:
NFD Master to have a steady memory usage.

How to reproduce it (as minimally and precisely as possible):
Nous utilisons flux pour deployer le chart helm mais voici un équivalant avec les mêmes valeurs.

export NFD_NS=node-feature-discovery
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo update
helm install nfd/node-feature-discovery node-feature-discovery --namespace node-feature-discovery --create-namespace --set master.replicaCount=2 --set gc.replicaCount: 2

Anything else we need to know?:
This is the installation of a new cluster, so I don't know if this problem appeared since version 0.16.4 or if it was already present before.

Environment:

  • Kubernetes version (use kubectl version): v1.28.12
  • Cloud provider or hardware configuration:
    • HPE Apollo 6500 Gen10 Plus
    • 2 x AMD EPYC 7543 32-Core
    • 32 x 32GB DDR4
    • 8 x A100 SXM 80GB
    • Mellanox ConnectX-6 Dx 100GbE
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
  • Kernel (e.g. uname -a): Linux 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Network plugin and version (if this is a network-related bug): Cillium 1.15.7
  • Others:
    • gpu-operator v23.9.2
@subnet-dev subnet-dev added the kind/bug Categorizes issue or PR as related to a bug. label Aug 18, 2024
@marquiz
Copy link
Contributor

marquiz commented Aug 19, 2024

@subnet-dev thank you for reporting the issue. Looks strange (and severe) indeed 🤔

I'm trying to reproduce the issue but with no success so far. @subnet-dev would it be possible for you to test with replicaCount=1, and possibly also with some other NFD version (e.g. v0.16.0)?

ping @ArangoGutierrez

EDIT: @subnet-dev how does the log of the other nfd-master pod look like?

@subnet-dev
Copy link
Author

Hi @marquiz thanks for your quick reply.

I have a second test cluster with exactly the same tools/version of kubernetes deployed where I don't have this issue.

I will change the number of replicas to 1 for the number of master and gc.
I can try to rollback to v0.16.3 or v0.16.2 but not with an older version because the gpu operator restarts the containers containing the Nvidia drivers when the labels are removed (Bug fixed in v0.16.2).

Here are the logs for the second master:

I0812 12:51:08.875704       1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0812 12:51:08.876388       1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.4" nodeName="admin006.xxx" namespace="node-feature-discovery"
I0812 12:51:08.876461       1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0812 12:51:08.876760       1 nfd-master.go:1429] "configuration successfully updated" configuration=<
	AutoDefaultNs: true
	DenyLabelNs: {}
	EnableTaints: false
	ExtraLabelNs: {}
	Klog: {}
	LabelWhiteList: null
	LeaderElection:
	  LeaseDuration:
	    Duration: 15000000000
	  RenewDeadline:
	    Duration: 10000000000
	  RetryPeriod:
	    Duration: 2000000000
	NfdApiParallelism: 10
	NoPublish: false
	ResourceLabels: {}
	ResyncPeriod:
	  Duration: 3600000000000
 >
I0812 12:51:08.876851       1 nfd-master.go:1513] "starting the nfd api controller"
I0812 12:51:09.477767       1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0812 12:51:09.484095       1 metrics.go:44] "metrics server starting" port=":8081"
I0812 12:51:09.484203       1 leaderelection.go:250] attempting to acquire leader lease node-feature-discovery/nfd-master.nfd.kubernetes.io...
I0812 12:51:09.484515       1 component.go:36] [core][Server #1]Server created
I0812 12:51:09.484584       1 nfd-master.go:417] "gRPC health server serving" port=8082
I0812 12:51:09.484694       1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0812 12:51:33.511561       1 leaderelection.go:260] successfully acquired lease node-feature-discovery/nfd-master.nfd.kubernetes.io
I0812 12:51:34.512850       1 nfd-master.go:792] "will process all nodes in the cluster"
I0815 14:01:42.143896       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:10:51.600735       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:13:51.691317       1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"

I also see that I have error logs on the workers (perhaps the problem is unrelated):

E0819 08:15:32.546843       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:15:32.546891       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:15:32.635852       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:15:32.636224       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:16:32.545173       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:16:32.545253       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:16:32.626492       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:16:32.626856       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:17:32.546733       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:17:32.546791       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:17:32.632990       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:17:32.633790       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:18:32.547596       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:18:32.547643       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:18:32.622336       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:18:32.623088       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:19:32.546890       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:19:32.546935       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:19:32.618280       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:19:32.619050       1 nfd-worker.go:677] "feature discovery completed"
E0819 08:20:32.546244       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:20:32.546290       1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:20:32.627921       1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:20:32.628278       1 nfd-worker.go:677] "feature discovery completed"

Perhaps an important piece of information: the cluster is now made up of 12 servers.

@marquiz
Copy link
Contributor

marquiz commented Aug 19, 2024

@subnet-dev is the second cluster (without issues) also running with replicas=2?

BTW, you could also join the #node-feature-discovery slack channel to discuss the issue (with faster communication round-trip 😄)

@marquiz
Copy link
Contributor

marquiz commented Aug 19, 2024

@subnet-dev @ArangoGutierrez I have a strong suspicion that the problem is related to the awry interplay between the leader election, nfd api informers and the dynamic reconfiguration. I hope running with with replicas=1 would resolve the most immediate issue. #1847 is (one alternative) to resolve this for future releases. I'm running some tests to see if my hyphotesis looks correct...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants