Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeFeature resource should not be owned by Pod #1817

Open
ahmetb opened this issue Jul 26, 2024 · 2 comments · May be fixed by #1860
Open

NodeFeature resource should not be owned by Pod #1817

ahmetb opened this issue Jul 26, 2024 · 2 comments · May be fixed by #1860
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ahmetb
Copy link
Member

ahmetb commented Jul 26, 2024

Hello, I'm trying to understand why the NodeFeature resource is owned by DaemonSet Pods.

We've been using 0.4.0 and noticed that 0.16.x brings a ton of architectural changes. These changes seem to lead to removals under some scenarios like #1802, and removal though Kubernetes garbage collector being another.

In our use case, removal of node labels at any time is unacceptable since we heavily rely on nodeSelectors for scheduling (or preventing scheduling), program controllers to avoid touching nodes with certain labels etc. So things like point-in-time absence of a DaemonSet pod causing removal of NodeFeature.

Is there a reason why NodeFeature is owned by nfd-worker Pod, say, instead of Node?

cc: @lxlxok

@ahmetb ahmetb added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 26, 2024
@marquiz
Copy link
Contributor

marquiz commented Jul 29, 2024

Hello, I'm trying to understand why the NodeFeature resource is owned by DaemonSet Pods.

The reason is garbage collection, keeping the cluster clean of stale NodeFeature objects after uninstallation of NFD (and e.g. prevent another instance of NFD picking them).

We've been using 0.4.0 and noticed that 0.16.x brings a ton of architectural changes. These changes seem to lead to removals under some scenarios like #1802, and removal though Kubernetes garbage collector being another.

There were issues in v0.16.0 but those should be now mitigated in the latest v0.16.3.

In our use case, removal of node labels at any time is unacceptable since we heavily rely on nodeSelectors for scheduling (or preventing scheduling), program controllers to avoid touching nodes with certain labels etc. So things like point-in-time absence of a DaemonSet pod causing removal of NodeFeature.

I can see you point. This is now mitigated so that the NodeFeature objects are owned by both the Pod and the Daemonset. NodeFeatures are only GC'd if you uninstall the daemonset.

What we could do, is to add something like core.noOwnerRefs configuration setting to nfd-worker to prevent automatic GC for good. WDYT?

Is there a reason why NodeFeature is owned by nfd-worker Pod, say, instead of Node?

A namespaced object (NodeFeature) cannot be "owned" (as by ownerReference) by a cluster-scoped object (Node).

@ahmetb
Copy link
Member Author

ahmetb commented Aug 16, 2024

Sorry I forgot to reply.

A namespaced object (NodeFeature) cannot be "owned" (as by ownerReference) by a cluster-scoped object (Node).

Are you sure? I think it's the other way around. :)

What we could do, is to add something like core.noOwnerRefs configuration setting to nfd-worker to prevent automatic GC for good. WDYT?

IMO this should be the default. Many serious companies rely on persistence of node labels for scheduling-time decisions, as there's no way to select a particular group of nodes (pools) some other way in a heterogeneous cluster.

The reason is garbage collection, keeping the cluster clean of stale NodeFeature objects after uninstallation of NFD

If I'm uninstalling NFD I delete CRD which deletes CRs of that type. I don't follow why this is a path the project needs to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants