Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

Merged
merged 2 commits into from
Sep 1, 2020

Conversation

kabakaev
Copy link
Contributor

Description of the change

Though PR #2453 decreased probability of zombie PIDs generation, the problem still exists, see Steps to reproduce below.

The same issue probably exists in the redis-cluster chart. Let me know if a PR for redis-cluster is needed.

This PR fixes #2441 once and for all.

Benefits

Whatever a user may write into readiness or liveness probes, no more zombie PIDs will be accumulated. Default settings will also stop generating zombies randomly.

Possible drawbacks

None.

Applicable issues

Root cause

Both livenessProbe and readinessProbe are configured as exec of a shell script. Each script has a configurable timeout parameter. The same timeout value is used in the timeoutSeconds parameters of the probe spec. This creates a race condition: if the probe script exits longer than the kubelet Go runtime schedules the timeout handler, then kubelet will fire the probe timeout procedure, leaving behind the probe script zombie PID. From what i see, such race happens more often on nodes with high CPU utilization.

This PR increases the timeoutSeconds parameter by 1 second, giving the probe script enough time to finish and let exit code reach kubelet.

The second commit also enables shareProcessNamespace pod option, which is the universal solution to the problem of zombie PID accumulation. Details are given below.

Steps to reproduce

Tested in kind.

  1. Install redis chart: helm upgrade test1 ./bitnami/redis --install --set=cluster.enabled=false,usePassword=false,persistence.enabled=false,metrics.enabled=false,master.livenessProbe.enabled=false

  2. Make sure the pod is running:

    kubectl get po
    #NAME                   READY   STATUS    RESTARTS   AGE
    #test1-redis-master-0   1/1     Running   0          5m1s
  3. Run a watch ps inside container:

    kubectl exec -ti test1-redis-master-0 -- bash -c "watch -n1 'ps auxwf'"
    #Every 1.0s: ps auxwf
    #
    #USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    #1001      2586  0.1  0.0   2988  2328 pts/0    Ss+  19:47   0:00 watch -n1 ps auxwf
    #1001      4446  0.0  0.0   2988   424 pts/0    S+   19:48   0:00  \_ watch -n1 ps auxwf
    #1001      4447  0.0  0.0   2384   764 pts/0    S+   19:48   0:00      \_ sh -c ps auxwf
    #1001      4448  0.0  0.0   7636  2768 pts/0    R+   19:48   0:00          \_ ps auxwf
    #1001         1  0.1  0.0  52792  7752 ?        Ssl  19:46   0:01 redis-server *:6379
  4. Find the redis process in host and stop it:

    # run in another terminal outside any cotnainer namespace
    ps ax | grep -E 'redis-server|COMMAND'
    #USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    #1001     30042  0.1  0.0  52792  7672 ?        Ssl  19:46   0:01 redis-server *:6379
    
    kill -STOP 30042
  5. Look back on the ps watch inside containter to see the zombie PIDs ("defunct").

  6. "Unpause" redis: kill -CONT 30042.

  7. The zombie PIDs will remain.

    1001         1  0.1  0.0  52792  7672 ?        Ssl  19:46   0:01 redis-server *:6379
    1001       599  0.0  0.0      0     0 ?        Z    19:48   0:00 [ping_readiness_] <defunct>
    1001       612  0.0  0.0      0     0 ?        Z    19:48   0:00 [ping_liveness_l] <defunct>
    

Why even bother about zombie PIDs?

Though it may seem not important, but high amount of zombies may lead to all sorts of problems, from inability to make redis snapshot to dead kubernetes daemons. With health probes running every 5 seconds, it's quite easy to reach a pod PID limit.

Shared process namespace is good

In kubernetes, shareProcessNamespace: true option has a side effect that all containers will run with proper init process.
Specifically, the first process in first container of a pod gets PID=1 (init), which becomes also PID=1 in all other containers. In kubernetes, first pod process is always /pause from the pause container (part of network setup). This mighty binary can read exit codes of orphaned processes and thus prevent zombies from exhausting the pod/node PID limit.
Such shared PID namespace mode was enabled in k8s@docker by default for some time. But with introduction of shareProcessNamespace option, the shared PID NS was switched back to disabled by default.
Thus, an easy solution would be to enable shareProcessNamespace on all pods which use exec health check.

Otherwise, in order to avoid leaving zombies behind, each exec script must be written to finish in a guaranteed time shorter as the k8s probe timeoutSeconds parameter. This is nearly impossible to do for user-defined health checks.

Checklist

  • Chart version bumped in Chart.yaml according to semver.
  • Variables are documented in the README.md
  • Title of the PR starts with chart name (e.g. [bitnami/chart])
  • If the chart contains a values-production.yaml apart from values.yaml, ensure that you implement the changes in both files

@dani8art
Copy link
Contributor

Hi @kabakaev, Thanks for opening this PR!

I think is a good approach to have less timeout in the probe script than in the probe spec so the probe script could finish safely and we'll avoid the zombie PID's, I've just had a question regarding this: what would be the best time between them? 1 sec would be enough? I mean, It could depend on the CPU usage of the cluster, right? wasn't it good to add a new property to let the user configure it?

Regarding sharedProcessNamespace, could it bring back some security issues? we would like to keep them separated if the timeout approach already fixed the PID's problem I would keep it as default in Kubernetes.

Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>
@kabakaev
Copy link
Contributor Author

@dani8art, thanks for quick review!

I think is a good approach to have less timeout in the probe script than in the probe spec. (...) Wasn't it good to add a new property to let the user configure it?

The scrips are written already in a way to guarantee that the probe ends in exact time specified, maybe only a few milliseconds longer than configured. Thus, it's not the the kubernetes object spec's timeoutSeconds parameter which defines exact timeout, but the value passed as first parameter to a health check script.
In fact, the timeoutSeconds in pod spec can have any integer value, provided it is higher than the value passed to the health check script. IMHO, no separate parameter is required, because sense of timeoutSeconds is not to set the pod parameters, but to guarantee the timeout is small enough to prevent many simultaneous health checks.

Moreover, setting the timeout to timeoutSeconds - 1 will not be transparent. Think of master.readinessProbe.timeoutSeconds=1 (the default now). I think, the comment in generated yaml is enough, though we may copy it over to all places where add1() function is used. HDYT?

Regarding sharedProcessNamespace, could it bring back some security issues?

I'm not aware of any security issues with shared process namespace. In fact, shared PID namespace was default setting for kubernetes@docker for a while.

Regarding 1sharedProcessNamespace1, (...) I would keep it as default in Kubernetes.

I'm fine with that. First commit already fixes the issue for default settings of liveness/reasiness probes.
Still, it may be helpful to be able to switch on the master.sharedProcessNamespace or slave.sharedProcessNamespace if customReadinessProbe or customLivenessProbe is used.

I've updated the second commit accordingly.

What about bitnami/redis-cluster? Personally, i do not use it. Should we leave it broken?

@dani8art
Copy link
Contributor

dani8art commented Sep 1, 2020

What about bitnami/redis-cluster? Personally, i do not use it. Should we leave it broken?

I think we should tackle as well, are you able to open a PR too?

Still, it may be helpful to be able to switch on the master.sharedProcessNamespace or slave.sharedProcessNamespace if customReadinessProbe or customLivenessProbe is used.

That's right but could you set it false by default in values yaml as it is the default behavior in k8s now, please?

Regarding timeout, it is ok to me, let see how it works and if there is no issue with that. In case someone comes with an issue then we will change to a separate parameter.

Thank you so much for the detailed explanation.

@kabakaev
Copy link
Contributor Author

kabakaev commented Sep 1, 2020

could you set it false by default in values yaml

I said that it's switched to false, but forgot to stage changes before git commit --amend yesterday :)

The change is pushed properly now.

@kabakaev
Copy link
Contributor Author

kabakaev commented Sep 1, 2020

are you able to open a PR too?

Sure, will do.

Copy link
Contributor

@dani8art dani8art left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Just a small change.

Thank you so much!

bitnami/redis/Chart.yaml Outdated Show resolved Hide resolved
Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zombie processes caused by health checks
2 participants