Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Plugin.stop signature to allow asynchronous stop lifecycle #83612

Closed
pgayvallet opened this issue Nov 18, 2020 · 9 comments · Fixed by #170225
Closed

Change Plugin.stop signature to allow asynchronous stop lifecycle #83612

pgayvallet opened this issue Nov 18, 2020 · 9 comments · Fixed by #170225
Labels
discuss enhancement New value added to drive a business result Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

Initial discussion started because of #80941. In this issue, the event_log plugin needs to perform some asynchronous cleanup when Kibana stops.

Our current Plugin interface does not declare stop as possibly asynchronous:

export interface Plugin<
TSetup = void,
TStart = void,
TPluginsSetup extends object = object,
TPluginsStart extends object = object
> {
setup(core: CoreSetup, plugins: TPluginsSetup): TSetup | Promise<TSetup>;
start(core: CoreStart, plugins: TPluginsStart): TStart | Promise<TStart>;
stop?(): void;
}

Even if technically, we are awaiting for plugins to stop:

while (this.satupPlugins.length > 0) {
const pluginName = this.satupPlugins.pop()!;
this.log.debug(`Stopping plugin "${pluginName}"...`);
await this.plugins.get(pluginName)!.stop();
}

I also know that we got plans to deprecates asycnhronous setup and start methods. However, even when we'll do that, I have the feeling that allowing asynchronous stop lifecycle would make sense, as if stop is only synchronous, there is effectively no way to perform async cleanups during this stage (this process will just terminate if the method needs to perform an async call)

So my two questions:

  • Should we 'fix' our Plugin.stop signature to reflect that we currently allow plugins to return promises and that we do wait for them?
  • Do we want to keep this behavior, and if so, do we want to preserve it even when we'll move to sync-only setup and start?
@pgayvallet pgayvallet added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result labels Nov 18, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@pmuellr
Copy link
Member

pmuellr commented Nov 18, 2020

I have the feeling that allowing asynchronous stop lifecycle would make sense, as if stop is only synchronous, there is effectively no way to perform async cleanups during this stage (this process will just terminate if the method needs to perform an async call)

Agree. It doesn't have to be that stop() would have to be async and platform would wait on it, in this particular case - there could be another mechanism used during stop() to provide a promise to wait on before having Kibana terminate. That doesn't feel right, as the current behavior of waiting for stop() to complete means you KNOW that plugin is done, so anything plugin depended on can be stopped safely.

I'm actually expecting that behavior, but in the opposite way - event log is used by several plugins, and so my hope is that by the time they've stop()d, they won't be writing any events out anymore, and so we'll end up capturing any lingering events they might have generated.

One thing we WILL need to do is put a timeout on the stop() invocations (if we don't already), to keep a stop() that never returns from keeping Kibana from shutting down. I don't think we need any special "cancel" protocol to inform a plugin it's timeout is up, but obviously logging that a plugin's stop() got hung will be appropriate.

@pgayvallet
Copy link
Contributor Author

I'm actually expecting that behavior, but in the opposite way - event log is used by several plugins, and so my hope is that by the time they've stop()d, they won't be writing any events out anymore, and so we'll end up capturing any lingering events they might have generated.

We are stopping (and awaiting) plugin in reverse order, meaning that if pluginA depends on pluginB, pluginA will start AFTER pluginB but will be stopped BEFORE, so you should be safe regarding this.

@rudolf
Copy link
Contributor

rudolf commented Nov 19, 2020

The lifecycles unblocked RFC didn't actually touch on how we want stop to behave.

However, to avoid data loss, we need a way for Kibana to gracefully shutdown like clearing all incoming connection queues or in-progress tasks. I mention this in the saved object migrations RFC, but it isn't totally fleshed out.
https://github.com/rudolf/kibana/blob/8f10cf22a93bd87a316a13a7965d4f5c8f160738/rfcs/text/0013_saved_object_migrations.md#4213-upgrade-and-rollback-procedure

The longer Kibana takes to shutdown gracefully the more downtime a customer has when they're trying to upgrade Kibana. If Kibana is running inside an orchestration layer like Elastic Cloud / Kubernetes then Kibana refusing to shutdown after a SIGINT could cause a long downtime until someone manually intervenes. So I think it makes sense to have an upper bound, after SIGINT kibana won't spend more than X minutes trying to stop safely.

If we give each plugin a budget of 30s before we run the next plugin's stop, then it becomes hard to know what's the maximum time it could take, because this depends on how deep the dependency graph is. I'm also not sure what we will do when a plugin's budget is over, we probably just want to go ahead and stop the other plugins in the dependency graph anyway. This means at the end of the day, there's no guarantee that dependant plugins will be truly stopped, so when we call event log's stop it doesn't guarantee that no more work will arrive.

After we receive a SIGINT, core will reject all incoming requests so plugins won't accept new work, but a plugin might be busy processing work. So if event logs are truly critical I think the plugin needs to flush it's queue, resolve the stop promise, and then enter a non-queue mode where every incoming event is immediately persisted. That way it keeps accepting and writing logs as much as possible. Maybe this is overkill since a stop timeout is hopefully an edge case?

@pgayvallet
Copy link
Contributor Author

The longer Kibana takes to shutdown gracefully the more downtime a customer has when they're trying to upgrade Kibana.

Is waiting one more minute to stop Kibana really an issue when you know the time it takes to perform a 5k+ objects migration?

then Kibana refusing to shutdown after a SIGINT could cause a long downtime until someone manually intervenes

I can bet that any process unresponsive after a SIGINT in an orchestrated environment will nicely receive a SIGTERM after a given timeout. At least I know that for sure for K8S, or any container-based infra. Having an upper bound / timeout for plugins to stop still of course make sense.

If we give each plugin a budget of 30s before we run the next plugin's stop, then it becomes hard to know what's the maximum time it could take

Quite easy actually. Max(stop) = plugins.length * 30s 😄
I doubt all plugins need to perform a cleanup though, nor that this threshold will be reached unless something bad is happening (such as connectivity issues with ES)

I'm also not sure what we will do when a plugin's budget is over, we probably just want to go ahead and stop the other plugins in the dependency graph anyway

Seems like the thing to do yea.

@rudolf
Copy link
Contributor

rudolf commented Nov 30, 2020

Quite easy actually. Max(stop) = plugins.length * 30s 😄
I doubt all plugins need to perform a cleanup though, nor that this threshold will be reached unless something bad is happening (such as connectivity issues with ES)

The important thing is that we provide users with a fixed number which they can use to for instance set their Kubernetes terminationGracePeriodSeconds. At the same time we want to give every plugin a chance to shutdown. plugins.length * 30s = 50 minutes which is unrealistic, but if we tell users to set terminationGracePeriodSeconds any lower it means some plugins will not be able to shutdown. If for instance Elasticsearch is down or experiencing load problems plugins could easily exceed their budget. In such a case there will always be data loss, but we should try to give every plugin a chance to mitigate it. Audit logs might want to print an error saying "the following audit logs could not be persisted..."

In the example of audit logs, it will have to wait for all the plugins that depend on it before its shutdown handler is called, which means it potentially gets a very small piece of the shutdown budget.

One solution could be to introduce a new terminate lifecycle which is synchronous. The Kibana process will terminate in the next tick after this lifecycle method is called. So audit logs could use this method to log any events that it weren't able to persist. If users really care about these they can manually insert them into ES. This way we can ensure that there's never data loss (if users read their log messages)

@pgayvallet
Copy link
Contributor Author

pgayvallet commented Nov 30, 2020

The important thing is that we provide users with a fixed number which they can use to for instance set their Kubernetes terminationGracePeriodSeconds. At the same time we want to give every plugin a chance to shutdown. plugins.length * 30s = 50 minutes which is unrealistic, but if we tell users to set terminationGracePeriodSeconds any lower it means some plugins will not be able to shutdown

Might, not will.

That is assuming that all plugins would be fully consuming their stop timeout period, which seems very far from being realistic:

  • Most plugins will probably not even define a stop async lifecycle (I mean, that's the first time we got a request for this feature, and KP migration is done for all plugins already, so we can assume this feature is only required for a very small number of plugins?)
  • The ones that do shall not, except in edge cases such as a connectivity issues, reach their timeout. I mean, that's a timeout, not a budget.

Honestly, documenting the required grace period to be 3 or 4 minutes would probably cover the 99 percentile of our shutdown scenarios (But even that number is enormous to be honest. Give or take, but I'd bet 1min should be covering 95pct). We could probably monitor that periodically to see if we need to update this documented fixed number.

A terminationGracePeriodSeconds is always an approximation / x percentile, not the maximum time the process could possibly wait before terminating normally. (and a shutdown scenario where all the plugins exceed/hit their stop timeout should not be considered a 'normal' scenario. That would be caused by an external issue, so killing the process in that case would make sense anyway)

Also, as already stated, as the Kibana process can be terminated for a lot of unexpected reasons (power outage, or more realistically kubectl delete deployment --grace-period=0), All cleanup / finalizing work performed in stop may not be called at all anyway, so plugins MUST be resilient to such scenarios (the point is, all shutdowns are not assured to be done gracefully).

One solution could be to introduce a new terminate lifecycle which is synchronous. The Kibana process will terminate in the next tick after this lifecycle method is called. So audit logs could use this method to log any events that it weren't able to persist. If users really care about these they can manually insert them into ES

We already have a stop sync handler, and audit logging's stop is already assured to be called after any plugins depending on it, so i'd say it would already work atm?

@mshustov
Copy link
Contributor

mshustov commented Jan 6, 2021

To summarise: here are several teardown cases:

  • Kibana stops normally due to an external command. It might shut down gracefully. It's safe to wait for all the plugins to finish the work in the stop lifecycle with some precautions: stop shouldn't throw, stop shouldn't timeout, etc.
  • Kibana stops due to an internal problem - due to an exception bubbled to the root for example. it can't be safe to call stop as Core API might be in a not usable state. Right now we don't distinguish between this scenario and the "graceful shutdown", so in theory stop might throw an exception and interrupt the stop calling chain.
  • Kibana process terminated: an unhandled promise, for example. Kibana won't even try to call the stop lifecycle.

From the list above it doesn't look like the proposed terminate lifecycle covers 2nd & 3rd cases.
async stop covers the 1st scenario, but still cannot provides guarantees that stop will be executed in 100% of cases.

@pmuellr
Copy link
Member

pmuellr commented Jan 14, 2021

I think that's about as good as you can get - I'll take it!

pgayvallet added a commit that referenced this issue Nov 6, 2023
## Summary

Fix #83612

This PR doesn't change any behavior, as we're already supporting (and
awaiting) promises returned from `stop` calls to plugin, it just changes
the type's signature to reflect that.

Also removed empty `stop` methods from existing plugins to make
typescript happy.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New value added to drive a business result Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants