Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] deferContainers - Pod Termination Semantics #483

Closed
wants to merge 3 commits into from

Conversation

dhilipkumars
Copy link
Contributor

@dhilipkumars dhilipkumars commented Mar 25, 2017

Proposal for deferContainers a complimenting feature to init-containers that enable graceful pod termination.

@quinton-hoole @deepak-vij @shashidharatd @irfanurrehman @kevin-wangzefeng

@smarterclayton @Kargakis @erictune @kow3ns @hectorj2f @davidopp

It wold be great to get your feedback on this, also could someone tag sig-apps please.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 25, 2017
@dhilipkumars dhilipkumars changed the title deferContainers [Proposal] deferContainers Mar 26, 2017
@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-feature-requests @kubernetes/sig-node-feature-requests

@erictune
Copy link
Member

erictune commented Apr 3, 2017

@kubernetes/sig-node-api-reviews would need to comment on this too since it would be implemented in the node.

## Caviate
This Design primarily focuses on handling graceful termination cases. If the Node running a deferContainer configured pod
crashes abruptly, then this design does not guarantee that cleanup was performed gracefully. This still requires community
feedback on how such scenarios are handled and how important it is for deferContainers to handle that situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite crucial. In case of initContainers you have a guarantee if container is running => Init container was executed and it executed successfully. In case of deferContainers we have no guarantees about that. Not to mention fetching logs from deleted pods (no pod => no pod name in case of controllers => no way to get deferContainer logs).

Probably all of that has been already mentioned somewhere else, but I wanted to raise it also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, we could not think of an elegant way to solve this problem, we really need to get communities feedback on how this is done today but we believe even without this deferContainers would be useful in many cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this without any idea how to solve that problem is bad idea, it may create some illusion that some functionality is provided, while on production it may work not as "expected". ( compared with initContainers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Termination constructs are important but providing absolute guarantee like their initialization counterparts are difficult to achieve, especially when there are ungraceful terminations like node crash.

In programming languages C++'s destructor, golang's defer(), atexit in C. None of this is guaranteed to work if the process crashes (ungraceful termination). This does not mean they are not useful. Their initialization counterpart offers higher level of guarantee.

We think the current design is a good starting point for providing better support for graceful POD termination. We might be able to improve it along the way as the users try it and provide more feedback. Initial implementation will be alpha status so should caution community from trying this in Production without enough tests.

What do you think?

Copy link

@kfox1111 kfox1111 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming it an eventContainer, and then passing in event kind as an env var?
something like: POD_EVENT=stop, POD_EVENT=recover, etc. We only need to implement the first, stop, for now, but can reuse the machinery later for things like, maybe informing a statefulset pod it will be shutting down but will be replaced. so like, in cephs case, the statefulset might want to set no-out during the upgrade as it will come right back. but not set no-out during a delete as it might not come back and the cluster should recover. Kind of like how rpm spec install hooks work during remove/upgrade.


## Use Cases

### Cleanup:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to handle situation:

  • you want to delete pod because it's broken or whatever
  • it has deferContainer that will hung because of "some reasons"

How am I supposed to not wait for deferContainer? Are we going to add kubectl delete .... --ignore-defer-containers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking we should probably force the delete in that case we wont need to run deferContainers

`kubectl delete pod foo --grace-period=0 --force

But this new option is cool too, it seperates out the funtionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't backwards compatibility mean that if someone previously ran delete pod foo --grace-period=0 to immediately terminate without waiting, that should skip deferContainers as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it might force --grace-period to all the other deferContainers, predicting the running-time of a certain cleaup/termination activity might be difficult. Or may be we could have --grace-period=-1 for unlimited termination time for a POD?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to solve "force shutdown" "force shutdown + ignore defer"? Pls, don't get me wrong but it seems to be a bit inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I now see the inconsistency, to keep it more predictable we should stick with,

kubectl delete pod foo --grace-period=0 --force

as originally proposed. Even now this is the way to avoid running PreStop hooks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to come up with some other signal. --grace-period=1 means "delete fast" and --grace-period=0 means "give up strong consistency and risk split brain". You don't want to do the latter.

I'm not sure there is a backward compatible way to allow defer containers to run without being bounded in time by the grace-period. I'm also not sure we want defer containers to run unbounded, since that was a key part of the graceful deletion design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I have added some design thoughts on how we handle time boundness for deferContainers, this was something i wanted to bring up in the SIG-APPs meeting yesterday. PTAL.

Also this particular discussion is about deleting the POD without running deferContainers or preStopHooks, an equivalent of kill -9 PID. Could you please elaborate why should we come up with a different signal for this?


## Limitation with the current system
Container pre-stop hooks are not sufficient for all termination cases:
* They cannot easily coordinate complex conditions across containers in multi-container pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a specific use case where the current Pod hooks are not sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container hooks are not generic to POD itself they are specific to containers, imagine a multi-container pod with 1 App Container and 1 Side Car
SideCar : Membership/Gossip Container
AppContainer: A Stateful Sharded workload
Termination Sequence: First one needs to re-shard/rebalance and then come out of membership
Below code suggests that both the preStop hooks (if configured for both the containers) will be started in parllel

for _, container := range runningPod.Containers {
		go func(container *kubecontainer.Container) {
			defer utilruntime.HandleCrash()
			defer wg.Done()

			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
			if err := m.killContainer(pod, container.ID, container.Name, "Need to kill Pod", gracePeriodOverride); err != nil {
				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
			}
			containerResults <- killContainerResult
		}(container)
	}
  1. hard to co-ordinate if both have different preStop hooks,
  2. hard consolidate this logic to one preStop script if both are from different images. Individually upgrading the images becomes a problem
  3. If only one preStop command failed only that should be retried and not the whole script(currently preStop hooks restarts are not available user shoudl include them in the script)

## Limitation with the current system
Container pre-stop hooks are not sufficient for all termination cases:
* They cannot easily coordinate complex conditions across containers in multi-container pods
* They can only function with code in the image or code in a shared volume, which would have to be statically linked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would I want to pull down new code to turn down a storage system. If the primary use cases are those expressed above, isn't all the code in my container already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective is not always to pull a new image, the objective is to seperate out/de-couple Termination sequence. Most often we wont have to pull a new image all togather. I can also imagine simple cleanups like 'rm -rf /tmp/dir1' wrapped around bash or busyboxy images which are lighter.

If your pod is part of a CI engine and running 1000s of testcases, it should be easy to configure a deferContainer (from a different image/source) simply upload (curl POST) currently processed test results to a different server when the POD is getting evicted out of that node.
If curl is not available in your primary container you dont have to re-package and create a new image just to run curl at the termination - which will be the case for preStop hooks.

* They cannot easily coordinate complex conditions across containers in multi-container pods
* They can only function with code in the image or code in a shared volume, which would have to be statically linked
(not a common pattern in wide use)
* They cannot be implemented with the current Docker implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on what is meant by this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Docker does not allow us to configure any termination hooks. I think its better to re-phrase or remove it to avoid confusion.

@kow3ns
Copy link
Member

kow3ns commented Apr 3, 2017

If I understand your design, it seems you want to move to a model where we run deferContainers during terminationGracePeriod seconds to provide a more eloquent abstraction than handling cleanup from the containers entry point.

  1. Other than being "more eloquent," isn't this functionally equivalent with what we can do now with the exception that you could potentially pull down new code in your deferContainer?
  2. When would you need to pull down a new image to cleanup an existing image? I think its a little dangerous to not package your cleanup code in the same container as the application. If you have an image pull failure for your defer container, you will not be able to cleanup successfully.
  3. I would think running commands in the same container to clean up a stateful application would be a more beneficial abstraction than pulling down a new image.
  4. deferContainers will make it more difficult to reason about terminationGracePeriod's. You will now have to budget image pull time into grace period. This can be highly variable based on the environment.

@dhilipkumars
Copy link
Contributor Author

1. Other than being "more eloquent," isn't this functionally equivalent with what we can do now with the exception that you could potentially pull down new code in your deferContainer?

Yes de-couple the termination logic, handle complex termination sequence, upgrade/change the primary container images without affecting initialization or termination scripts. More predictable termination behaviour

2. When would you need to pull down a new image to cleanup an existing image? I think its a little dangerous to not package your cleanup code in the same container as the application. If you have an image pull failure for your defer container, you will not be able to cleanup successfully.

It is not always necessary to pull a new image for cleanup, Most cleanups might use the same image but just that now it is more predictable. Modularize the termination sequence and ability to restart only failed steps. Currently, if preStop hook failed the (only way to trigger cleanup) nothing much can be done. Imagine in a master-slave type workload a terminating master is promoting a slave and if that command failed, it is important to retry this command. The user has to implement logic to handle this functionality. Below code doesnt attempt to retry preStop hooks if failed.

// executePreStopHook runs the pre-stop lifecycle hooks if applicable and returns the duration it takes.
func (m *kubeGenericRuntimeManager) executePreStopHook(pod *v1.Pod, containerID kubecontainer.ContainerID, containerSpec *v1.Container, gracePeriod int64) int64 {
	glog.V(3).Infof("Running preStop hook for container %q", containerID.String())

	start := metav1.Now()
	done := make(chan struct{})
	go func() {
		defer close(done)
		defer utilruntime.HandleCrash()
		if msg, err := m.runner.Run(containerID, pod, containerSpec, containerSpec.Lifecycle.PreStop); err != nil {
			glog.Errorf("preStop hook for container %q failed: %v", containerSpec.Name, err)
			m.generateContainerEvent(containerID, v1.EventTypeWarning, events.FailedPreStopHook, msg)
		}
	}()

3. I would think running commands in the same container to clean up a stateful application would be a more beneficial abstraction than pulling down a new image.

deferContainers would allow us to run an arbitrary set of containers during termination (preStop) for the entire pod. If one develops an operator for a particular TRP workload which has reasonable termination sequence then one can use the primary image like redis or oracle directly from the source instead of repackaging that image with the termination scripts, without affecting each other the terimnation scripts/sequences can be improved and the primary image can be upgraded.

4. deferContainers will make it more difficult to reason about terminationGracePeriod's. You will now have to budget image pull time into grace period. This can be highly variable based on the environment.

I was thinking we should add a seperateflag to handle such rare cases where one would want to use a different image which is very heavy.

prePullDeferImages: true

such that as soon as the POD is started kublet can download and pre-populate these heavy images in the node?

@0xmichalis
Copy link
Contributor

if preStop hook failed the (only way to trigger cleanup) nothing much can be done. Imagine in a master-slave type workload a terminating master is promoting a slave and if that command failed, it is important to retry this command. The user has to implement logic to handle this functionality. Below code doesnt attempt to retry preStop hooks if failed.

Can't we fix pre-stop hooks to be retriable?

@dhilipkumars
Copy link
Contributor Author

Can't we fix pre-stop hooks to be retriable?

yes in my opinion that would make preStop hook more functional but still the termination script is tighgtly coupled with the AppImage and are container specific hooks.

@stp-ip
Copy link
Member

stp-ip commented Apr 7, 2017

With this proposal we definitely make termination logic more descriptive and extensive. I do like the syntactic ideas as they are similar to init containers for startup logic.

The main issue I have with this proposal is yet another way to handle something. If deferContainers are accepted, we should think about "deprecating" or at least advise against preStop hooks.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 10, 2017 via email

@dhilipkumars dhilipkumars changed the title [Proposal] deferContainers [Proposal] deferContainers - Pod Termination Constructs Apr 13, 2017
@dhilipkumars dhilipkumars changed the title [Proposal] deferContainers - Pod Termination Constructs [Proposal] deferContainers - Pod Termination Semantics Apr 14, 2017
@dhilipkumars
Copy link
Contributor Author

Ref kubernetes/kubernetes#3585

UpdateMore information about TerminationGracePeriod and technique for prepopulating heavy images before terminating phase.
Update design with more info and remove irrelevant portions.
@lukasheinrich
Copy link

lukasheinrich commented Apr 4, 2018

Just joining the discussio to add a use case: kubernetes/kubernetes#62117

For big-data analysis scenarios (such as high-energy physics) the data management concerns and the actual data-processing concerns are tended to by different teams and separating them is very useful. The sequence we would like to implement as part of a k8s Job object, is

  • initContainer: copy data from object storage into a local, fast I/O disk attached as a a dynamically provisioned volume
  • main Container: a FROM scratch container consisting only of a binary that an operate on the local data via POSIX, without having to have support / or knowledge / credentials to the object store from which the data originated
  • deferContainer: a container that takes output files written to the local disk by the main container and uploads it back to object storage

crucially, this would allow the data management teams to independently develop their stage-in/stage-out containers including authn/authz while keeping the ability to keep the main workload container as lean as possible (in an extreme case a statically compiled binary packaged as FROM scratch) and oblivious to authentication / credentials etc.

@lukasheinrich
Copy link

Hi, I discussed this isse with a couple of people at KubeCon, how can we move this forward?

@CaoShuFeng
Copy link
Contributor

/cc

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lukasheinrich
Copy link

/reopen I think this is still an interesting use case

@idealhack
Copy link
Member

/reopen

I'm not sure if we need to translate this proposal into a KEP to move it forward.

@k8s-ci-robot k8s-ci-robot reopened this Nov 21, 2018
@k8s-ci-robot
Copy link
Contributor

@idealhack: Reopened this PR.

In response to this:

/reopen

I'm not sure if we need to translate this proposal into a KEP to move it forward.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: smarterclayton

If they are not already assigned, you can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@dhilipkumars: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-community-verify 9e81de3 link /test pull-community-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor

smarterclayton commented Nov 21, 2018 via email

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkanmustafa
Copy link

Dear @fejta-bot can you please reopen?
/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@furkanmustafa: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Dear @fejta-bot can you please reopen?
/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.