[bitnami/redis] prevent zombie PIDs in redis health checks #3559

kabakaev · 2020-08-30T20:19:41Z

Description of the change

Though PR #2453 decreased probability of zombie PIDs generation, the problem still exists, see Steps to reproduce below.

The same issue probably exists in the redis-cluster chart. Let me know if a PR for redis-cluster is needed.

This PR fixes #2441 once and for all.

Benefits

Whatever a user may write into readiness or liveness probes, no more zombie PIDs will be accumulated. Default settings will also stop generating zombies randomly.

Possible drawbacks

None.

Applicable issues

fixes zombie processes caused by health checks #2441.

Root cause

Both livenessProbe and readinessProbe are configured as exec of a shell script. Each script has a configurable timeout parameter. The same timeout value is used in the timeoutSeconds parameters of the probe spec. This creates a race condition: if the probe script exits longer than the kubelet Go runtime schedules the timeout handler, then kubelet will fire the probe timeout procedure, leaving behind the probe script zombie PID. From what i see, such race happens more often on nodes with high CPU utilization.

This PR increases the timeoutSeconds parameter by 1 second, giving the probe script enough time to finish and let exit code reach kubelet.

The second commit also enables shareProcessNamespace pod option, which is the universal solution to the problem of zombie PID accumulation. Details are given below.

Steps to reproduce

Tested in kind.

Install redis chart: helm upgrade test1 ./bitnami/redis --install --set=cluster.enabled=false,usePassword=false,persistence.enabled=false,metrics.enabled=false,master.livenessProbe.enabled=false

Make sure the pod is running:

kubectl get po
#NAME                   READY   STATUS    RESTARTS   AGE
#test1-redis-master-0   1/1     Running   0          5m1s

Run a watch ps inside container:

kubectl exec -ti test1-redis-master-0 -- bash -c "watch -n1 'ps auxwf'"
#Every 1.0s: ps auxwf
#
#USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
#1001      2586  0.1  0.0   2988  2328 pts/0    Ss+  19:47   0:00 watch -n1 ps auxwf
#1001      4446  0.0  0.0   2988   424 pts/0    S+   19:48   0:00  \_ watch -n1 ps auxwf
#1001      4447  0.0  0.0   2384   764 pts/0    S+   19:48   0:00      \_ sh -c ps auxwf
#1001      4448  0.0  0.0   7636  2768 pts/0    R+   19:48   0:00          \_ ps auxwf
#1001         1  0.1  0.0  52792  7752 ?        Ssl  19:46   0:01 redis-server *:6379

Find the redis process in host and stop it:

# run in another terminal outside any cotnainer namespace
ps ax | grep -E 'redis-server|COMMAND'
#USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
#1001     30042  0.1  0.0  52792  7672 ?        Ssl  19:46   0:01 redis-server *:6379

kill -STOP 30042

Look back on the ps watch inside containter to see the zombie PIDs ("defunct").
"Unpause" redis: kill -CONT 30042.

The zombie PIDs will remain.

1001         1  0.1  0.0  52792  7672 ?        Ssl  19:46   0:01 redis-server *:6379
1001       599  0.0  0.0      0     0 ?        Z    19:48   0:00 [ping_readiness_] <defunct>
1001       612  0.0  0.0      0     0 ?        Z    19:48   0:00 [ping_liveness_l] <defunct>

Why even bother about zombie PIDs?

Though it may seem not important, but high amount of zombies may lead to all sorts of problems, from inability to make redis snapshot to dead kubernetes daemons. With health probes running every 5 seconds, it's quite easy to reach a pod PID limit.

Shared process namespace is good

In kubernetes, shareProcessNamespace: true option has a side effect that all containers will run with proper init process.
Specifically, the first process in first container of a pod gets PID=1 (init), which becomes also PID=1 in all other containers. In kubernetes, first pod process is always /pause from the pause container (part of network setup). This mighty binary can read exit codes of orphaned processes and thus prevent zombies from exhausting the pod/node PID limit.
Such shared PID namespace mode was enabled in k8s@docker by default for some time. But with introduction of shareProcessNamespace option, the shared PID NS was switched back to disabled by default.
Thus, an easy solution would be to enable shareProcessNamespace on all pods which use exec health check.

Otherwise, in order to avoid leaving zombies behind, each exec script must be written to finish in a guaranteed time shorter as the k8s probe timeoutSeconds parameter. This is nearly impossible to do for user-defined health checks.

Checklist

Chart version bumped in Chart.yaml according to semver.
Variables are documented in the README.md
Title of the PR starts with chart name (e.g. [bitnami/chart])
If the chart contains a values-production.yaml apart from values.yaml, ensure that you implement the changes in both files

dani8art · 2020-08-31T13:52:57Z

Hi @kabakaev, Thanks for opening this PR!

I think is a good approach to have less timeout in the probe script than in the probe spec so the probe script could finish safely and we'll avoid the zombie PID's, I've just had a question regarding this: what would be the best time between them? 1 sec would be enough? I mean, It could depend on the CPU usage of the cluster, right? wasn't it good to add a new property to let the user configure it?

Regarding sharedProcessNamespace, could it bring back some security issues? we would like to keep them separated if the timeout approach already fixed the PID's problem I would keep it as default in Kubernetes.

Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>

kabakaev · 2020-08-31T18:02:56Z

@dani8art, thanks for quick review!

I think is a good approach to have less timeout in the probe script than in the probe spec. (...) Wasn't it good to add a new property to let the user configure it?

The scrips are written already in a way to guarantee that the probe ends in exact time specified, maybe only a few milliseconds longer than configured. Thus, it's not the the kubernetes object spec's timeoutSeconds parameter which defines exact timeout, but the value passed as first parameter to a health check script.
In fact, the timeoutSeconds in pod spec can have any integer value, provided it is higher than the value passed to the health check script. IMHO, no separate parameter is required, because sense of timeoutSeconds is not to set the pod parameters, but to guarantee the timeout is small enough to prevent many simultaneous health checks.

Moreover, setting the timeout to timeoutSeconds - 1 will not be transparent. Think of master.readinessProbe.timeoutSeconds=1 (the default now). I think, the comment in generated yaml is enough, though we may copy it over to all places where add1() function is used. HDYT?

Regarding sharedProcessNamespace, could it bring back some security issues?

I'm not aware of any security issues with shared process namespace. In fact, shared PID namespace was default setting for kubernetes@docker for a while.

Regarding 1sharedProcessNamespace1, (...) I would keep it as default in Kubernetes.

I'm fine with that. First commit already fixes the issue for default settings of liveness/reasiness probes.
Still, it may be helpful to be able to switch on the master.sharedProcessNamespace or slave.sharedProcessNamespace if customReadinessProbe or customLivenessProbe is used.

I've updated the second commit accordingly.

What about bitnami/redis-cluster? Personally, i do not use it. Should we leave it broken?

dani8art · 2020-09-01T10:05:17Z

What about bitnami/redis-cluster? Personally, i do not use it. Should we leave it broken?

I think we should tackle as well, are you able to open a PR too?

Still, it may be helpful to be able to switch on the master.sharedProcessNamespace or slave.sharedProcessNamespace if customReadinessProbe or customLivenessProbe is used.

That's right but could you set it false by default in values yaml as it is the default behavior in k8s now, please?

Regarding timeout, it is ok to me, let see how it works and if there is no issue with that. In case someone comes with an issue then we will change to a separate parameter.

Thank you so much for the detailed explanation.

kabakaev · 2020-09-01T10:22:07Z

could you set it false by default in values yaml

I said that it's switched to false, but forgot to stage changes before git commit --amend yesterday :)

The change is pushed properly now.

kabakaev · 2020-09-01T10:23:23Z

are you able to open a PR too?

Sure, will do.

dani8art

LGTM!
Just a small change.

Thank you so much!

bitnami/redis/Chart.yaml

Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>

prevent generation of zombie PIDs in exec health probes

fcf7e50

Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>

kabakaev force-pushed the fix-redis-health-checks branch from 52404fc to 5011763 Compare August 31, 2020 17:52

kabakaev force-pushed the fix-redis-health-checks branch from 5011763 to 5405328 Compare September 1, 2020 10:18

dani8art approved these changes Sep 1, 2020

View reviewed changes

bitnami/redis/Chart.yaml Outdated Show resolved Hide resolved

redis: make shareProcessNamespace configurable

dc99958

Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>

kabakaev force-pushed the fix-redis-health-checks branch from 5405328 to dc99958 Compare September 1, 2020 10:32

dani8art merged commit abfeef4 into bitnami:master Sep 1, 2020

This was referenced Feb 25, 2021

[redis-cluster] zombie processes caused by health checks #5328

Closed

[bitnami/redis-cluster] Allow setting shareProcessNamespace #5652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

kabakaev commented Aug 30, 2020

dani8art commented Aug 31, 2020

kabakaev commented Aug 31, 2020

dani8art commented Sep 1, 2020

kabakaev commented Sep 1, 2020

kabakaev commented Sep 1, 2020

dani8art left a comment

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

[bitnami/redis] prevent zombie PIDs in redis health checks #3559

Conversation

kabakaev commented Aug 30, 2020

dani8art commented Aug 31, 2020

kabakaev commented Aug 31, 2020

dani8art commented Sep 1, 2020

kabakaev commented Sep 1, 2020

kabakaev commented Sep 1, 2020

dani8art left a comment

Choose a reason for hiding this comment