Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The gcsfuse sidecar container does not exit automatically when all the containers have exited and the Pod restartPolicy is Never #23

Closed
songjiaxun opened this issue May 10, 2023 · 7 comments

Comments

@songjiaxun
Copy link
Collaborator

songjiaxun commented May 10, 2023

In some use cases, where the Pod restartPolicy is Never, and the users expect the sidecar container will exit automatically when all the containers have exited.

For example, in the Airflow use cases, the Task will never complete and the DAG be blocked if the sidecar container does not exit automatically.

Note that we do support sidecar container auto-exit in Jobs -- the sidecar container will exit automatically when all the containers have exited in a Job Pod.

@songjiaxun
Copy link
Collaborator Author

The issue is fixed by the commit 8a1a860.

The fix will be included in the next release.

@songjiaxun
Copy link
Collaborator Author

This is fixed in the CSI driver v0.1.3 release.

@mescanne
Copy link

@songjiaxun -- I'm a bit confused by this. It seems that in the e.g. Airflow it still adds a full grace termination period to the end of every KubernetesPodOperator run.

My understanding from Kubernetes (based on this):

  • Termination Grace Period is the period of time after delivering the TERM signal.
  • If the process exits prior to the end of the grace period, then everything is finished at that point and you're done. If it hasn't, then it is forcibly KILLed and you're done more directly.

The Termination Grace Period is the period of time that the processes are allowed to exit gracefully -- if they don't, they're terminated abruptly. If they finish gracefully before then, that's great, too.

My understanding of the gcsfuse sidecar uses the same parameter but with very different behaviour:

  • Once all of the processes are terminated (and the pod isn't restarting or it is a job), then it first waits the grace period. This is to give the gcsfuse sidecar time to do work, but it isn't notified to shutdown.
  • It then sends a TERM signal to gcsfuse sidecar and waits for it to exit gracefully.

The Termination Grace is the period of time to wait (needed or not) just in case gcsfuse hasn't sync'd everything.

It looks like GCS Fuse doesn't handle TERM, but does INT and does the correct behaviour (unmounts and exits). (See this and this)

Based on the above, I think ideal behaviour would be:

  • Once all of the processes are terminated (and the pod isn't restarting or it is a job), then it immediately sends a Interrupt signal to gcsfuse and waits up to graceTerminationPeriod for it to exit.
  • If it exits early, then great -- you're done. If it doesn't exit by graceTerminationPeriod, then it is killed forcibly.

Given the code here, I think all that is required is delivering a interrupt signal to all of the processes prior to doing the sleep. The process should exit normally when all children processes are done, I believe.

@songjiaxun
Copy link
Collaborator Author

Thanks, @mescanne, for the question. So there are two scenarios actually.

First scenario

If the Pod is a Job Pod or the RestartPolicy is Never, the workflow is like the following:

  1. Once all the workload containers besides the sidecar container in the Pod terminated, and if the Pod is a Job Pod or the RestartPolicy is Never, the CSI driver will put a "exit file" to the sidecar container emptyDir volume to ask it to terminate.
  2. The sidecar container detects the "exit file" in its emptyDir volume, it will sleep 30 seconds. This is currently hard coded and not configurable.
  3. After the sleep, the sidecar container will send a SIGTERM signal to all the gcsfuse processes to kill them immediately.
  4. Lastly, the sidecar container will terminate, marking the Pod a terminated status.

As you can see, in this scenario, it does not respect the terminationGracePeriod. This logic is actually a workaround until the Kubernetes sidecar container feature is available. The sidecar container feature will handle the sidecar container auto-termination.

Second scenario

For other workloads, the Pods are supposed to run forever. In this case, if the Pod crashes, it follows the doc Kubernetes best practices: terminating with grace. Specifically,

  1. When the Pod crashes, a SIGTERM signal is sent to the Pod.
  2. The sidecar container captures the SIGTERM signal, and sleeps for the terminationGracePeriod.
  3. After terminationGracePeriod passed, the SIGKILL signal is sent to Pod.
  4. All the containers will be forcefully killed.

I hope the explanation is helpful.

@mescanne
Copy link

Thanks @songjiaxun !

Waiting 30 seconds at the end of a Job or Pod (with RestartPolicy Never) is quite a big painpoint as it imposes a severe delay for any Airflow job/task running.

GKE 1.28 availability
Do you know when Kubernetes 1.28 will be available in GKE stable? I see it is available in rapid, but with some caveat around the sidecar containers. So definitely close.

One note -- the gcs-fuse process itself doesn't seem to honour SIGTERM correctly, only SIGINT, and it will need to be modified as well.

If it's a few months or less then it's better to wait, but otherwise I have a proposal for improving the 30 second time for the first scenario with the exit file.

Proposal
Specifically for the First scenario when handling the exit file I was thinking a dramatic improvement -- prior to the Kubernetes sidecar container feature -- would be to amend the logic here as follows upon detection of the exit file:

  • Deliver a SIGINT to all of the running commands.
  • Wait until either the waitgroup signalled or the the grace period (30 seconds) is expired.
  • If the grace period expires without the waitgroup, then deliver process Kill() forcibly (as current behaviour)
  • Always deliver the SIGTERM signal for the sidecar container to clean up.

Why?

  • Give the gcs-fuse interrupt handler time to unmount cleanly and exit early. If this happens, you do not need to wait around the 30 seconds.
  • Revert to the normal behaviour if it takes up to the grace period.

Code

  • I've started on some prototype code and, more importantly, some integration tests to test the behaviour. If the 2.28 sidecar container behaviour is imminent then I'll hold off.

@songjiaxun
Copy link
Collaborator Author

Hi @mescanne , I created this commit to make the sidecar container respect the Pod terminationGracePeriod in the first scenario.

In your use case, you will need to specify a small terminationGracePeriodSeconds value on your Pod, e.g. 5, or even 0. Otherwise, the default value is 30 sec. I hope this improvement will be helpful.

This will be included in the next release.

Also, the sidecar container feature is going to be promoted to beta in Kubernetes 1.29, which will be the k8s version with the sidecar container feature enabled on GKE.

@mescanne
Copy link

Thanks, that is very helpful. In this case we can decrease the graceful termination. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants