Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-problem-detector cannot run in non-privileged mode #698

Open
ialidzhikov opened this issue Sep 1, 2022 · 15 comments
Open

node-problem-detector cannot run in non-privileged mode #698

ialidzhikov opened this issue Sep 1, 2022 · 15 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-kind Indicates a PR lacks a `kind/foo` label and requires one.

Comments

@ialidzhikov
Copy link

/kind bug

What happened?

Running containers in privileged mode is not recommended as privileged containers run with all linux capabilities enabled and can access the host's resources. Running containers in privileged mode opens number of security threads such as breakout to underlying host OS.

Currently the node-problem-detector DaemonSet runs in privileged mode.

securityContext:
privileged: true

Trying to run node-problem-detector in non-privileged mode (even with all capabilities added) one of its monitors fails with:

E0808 06:25:33.740326       1 problem_detector.go:55] Failed to start problem daemon &{/config/kernel-monitor.json 0xc00035b7a0 0xc000443100 {{kmsg map[] /dev/kmsg 5m } 10 kernel-monitor [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock kernel has no deadlock} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only}] [{temporary  OOMKilling Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {temporary  TaskHung task [\S ]+:\w+ blocked for more than \w+ seconds\.} {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .*} {temporary  KernelOops divide error: 0000 \[#\d+\] SMP} {temporary  Ext4Error EXT4-fs error .*} {temporary  Ext4Warning EXT4-fs warning .*} {temporary  IOError Buffer I/O error .*} {temporary  MemoryReadError CE memory read error .*} {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\.} {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only}] 0xc00043d21e} [] <nil> 0xc00045aea0 0xc00044bb80}: failed to create kmsg parser: open /dev/kmsg: operation not permitted

I don't fully understand what it requires to read kernel logs from /dev/kmsg.

What did you expect to happen?

I would expect to be able to run node-problem-detector in non-privileged mode.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2022
@ialidzhikov
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 30, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2023
@ialidzhikov
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2023
@balu-ce
Copy link

balu-ce commented May 4, 2023

Any update on this ?

@btiernay
Copy link
Contributor

Duplicate of #625

@AlexzSouz
Copy link

Duplicate of #625

Both issues DO NOT have a solution for the problem @ialidzhikov mentioned and that I'm currently experiencing. The "duplicate" issue you (@btiernay) shared only contains comments from @k8s-triage-robot. No solution is provided 🤷

Any solution so far?

@alazyer
Copy link

alazyer commented Dec 29, 2023

How about trying with plugin of journald instead? it works fine for me to detect "NodeOOM", "PodOOM" with pattern ".Out of memory." and ".Memory cgroup out of memory."

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2024
@wangzhen127
Copy link
Member

NPD's goal is to detect infra layer issues. So it needs to read logs in a place where non-privileged containers do not have permission. Additionally, we use health checker in production to repair kubelet and containerd by killing them. Those need privilege.

Depending on how you would like to use NPD, there may be a chance that you can tune your daemonset yaml without the privilege access. @hakman for kops, does it run NPD in non-privilege mode?

@wangzhen127
Copy link
Member

/remove-kind bug

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 5, 2024
@wangzhen127
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2024
@haardm
Copy link

haardm commented Jun 8, 2024

Hello, I am also facing similar issue while reading from /dev/kmsg using NPD while my container is not given privileged mode. Is there any workaround? We only need to read, no mutating actions on our side.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-kind Indicates a PR lacks a `kind/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

9 participants