-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFD Master memory leak #1841
Comments
@subnet-dev thank you for reporting the issue. Looks strange (and severe) indeed 🤔 I'm trying to reproduce the issue but with no success so far. @subnet-dev would it be possible for you to test with replicaCount=1, and possibly also with some other NFD version (e.g. v0.16.0)? ping @ArangoGutierrez EDIT: @subnet-dev how does the log of the other nfd-master pod look like? |
Hi @marquiz thanks for your quick reply. I have a second test cluster with exactly the same tools/version of kubernetes deployed where I don't have this issue. I will change the number of replicas to 1 for the number of master and gc. Here are the logs for the second master: I0812 12:51:08.875704 1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0812 12:51:08.876388 1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.4" nodeName="admin006.xxx" namespace="node-feature-discovery"
I0812 12:51:08.876461 1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0812 12:51:08.876760 1 nfd-master.go:1429] "configuration successfully updated" configuration=<
AutoDefaultNs: true
DenyLabelNs: {}
EnableTaints: false
ExtraLabelNs: {}
Klog: {}
LabelWhiteList: null
LeaderElection:
LeaseDuration:
Duration: 15000000000
RenewDeadline:
Duration: 10000000000
RetryPeriod:
Duration: 2000000000
NfdApiParallelism: 10
NoPublish: false
ResourceLabels: {}
ResyncPeriod:
Duration: 3600000000000
>
I0812 12:51:08.876851 1 nfd-master.go:1513] "starting the nfd api controller"
I0812 12:51:09.477767 1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0812 12:51:09.484095 1 metrics.go:44] "metrics server starting" port=":8081"
I0812 12:51:09.484203 1 leaderelection.go:250] attempting to acquire leader lease node-feature-discovery/nfd-master.nfd.kubernetes.io...
I0812 12:51:09.484515 1 component.go:36] [core][Server #1]Server created
I0812 12:51:09.484584 1 nfd-master.go:417] "gRPC health server serving" port=8082
I0812 12:51:09.484694 1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0812 12:51:33.511561 1 leaderelection.go:260] successfully acquired lease node-feature-discovery/nfd-master.nfd.kubernetes.io
I0812 12:51:34.512850 1 nfd-master.go:792] "will process all nodes in the cluster"
I0815 14:01:42.143896 1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:10:51.600735 1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx"
I0815 16:13:51.691317 1 nfd-master.go:1283] "node updated" nodeName="gpu014.xxx" I also see that I have error logs on the workers (perhaps the problem is unrelated): E0819 08:15:32.546843 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:15:32.546891 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:15:32.635852 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:15:32.636224 1 nfd-worker.go:677] "feature discovery completed"
E0819 08:16:32.545173 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:16:32.545253 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:16:32.626492 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:16:32.626856 1 nfd-worker.go:677] "feature discovery completed"
E0819 08:17:32.546733 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:17:32.546791 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:17:32.632990 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:17:32.633790 1 nfd-worker.go:677] "feature discovery completed"
E0819 08:18:32.547596 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:18:32.547643 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:18:32.622336 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:18:32.623088 1 nfd-worker.go:677] "feature discovery completed"
E0819 08:19:32.546890 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:19:32.546935 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:19:32.618280 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:19:32.619050 1 nfd-worker.go:677] "feature discovery completed"
E0819 08:20:32.546244 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/operstate: not a directory" attributeName="operstate"
E0819 08:20:32.546290 1 network.go:154] "failed to read net iface attribute" err="open /host-sys/class/net/bonding_masters/speed: not a directory" attributeName="speed"
I0819 08:20:32.627921 1 nfd-worker.go:664] "starting feature discovery..."
I0819 08:20:32.628278 1 nfd-worker.go:677] "feature discovery completed" Perhaps an important piece of information: the cluster is now made up of 12 servers. |
@subnet-dev is the second cluster (without issues) also running with replicas=2? BTW, you could also join the #node-feature-discovery slack channel to discuss the issue (with faster communication round-trip 😄) |
@subnet-dev @ArangoGutierrez I have a strong suspicion that the problem is related to the awry interplay between the leader election, nfd api informers and the dynamic reconfiguration. I hope running with with replicas=1 would resolve the most immediate issue. #1847 is (one alternative) to resolve this for future releases. I'm running some tests to see if my hyphotesis looks correct... |
What happened:
We use node feature discovery 0.16.4 on our GPU cluster in addition to NVIDIA's gpu-operator. The GPU operator is installed with NFD disabled so that these tools can be managed independently. Node Feature Discovery Master continuously utilises more memory and never release it. Memory usage increases by around 1.2 Gi/day in a very linear way for the master, which is the leader. I've set a memory limit of 4Gi to prevent the pod from using more memory. The configuration file
nfd-master.conf
is empty (nfd-master.conf: ‘null’).This screenshot was taken before rolling out the deployment:
This screenshot corresponds to the last pod start-up:
Here are the master logs since start-up:
What you expected to happen:
NFD Master to have a steady memory usage.
How to reproduce it (as minimally and precisely as possible):
Nous utilisons flux pour deployer le chart helm mais voici un équivalant avec les mêmes valeurs.
export NFD_NS=node-feature-discovery helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts helm repo update helm install nfd/node-feature-discovery node-feature-discovery --namespace node-feature-discovery --create-namespace --set master.replicaCount=2 --set gc.replicaCount: 2
Anything else we need to know?:
This is the installation of a new cluster, so I don't know if this problem appeared since version
0.16.4
or if it was already present before.Environment:
kubectl version
):v1.28.12
cat /etc/os-release
):Ubuntu 22.04.4 LTS
uname -a
):Linux 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Cillium 1.15.7
The text was updated successfully, but these errors were encountered: