SIGTERM should gracefully shutdown Kibana #84452

rudolf · 2020-11-27T16:32:09Z

To avoid data loss when Kibana is stopped (either to restart a process or upgrade Kibana), we should support a graceful shutdown which stops receiving new work and tries to complete all in-progress work before the process is terminated:

When Kibana receives a SIGTERM signal it starts a graceful shutdown and performs the following steps:
1. Node starts returning 503 from it's healthcheck endpoint to signal to the load balancer that it's no longer accepting new traffic (requires Create health check / readiness endpoint #46984).
2. Return 503 from all registered routes (edit: do we want to keep responding from /api/status and the healthcheck endpoints?). This provides a strong guarantee that Kibana won't accept any new work/long running connections. As a first step we could rely on users correctly configuring their load-balancer to no longer send any traffic, but this feels error prone and doesn't help users who don't run Kibana behind a load balancer.
3. Allow any existing HTTP requests to complete with a configurable timeout before forcefully terminating any open connections. This timeout should be configurable (is 30s a good default?) (edit: can be implemented by calling await hapiServer.stop({ timeout: 30 * 1000 });
4. Closes any keep-alive sockets by sending a connection: close header. (edit: already taken care of by hapi's server.stop() https://github.com/hapijs/hapi/blob/master/lib/core.js#L449-L456)
5. Shutdown all plugins and Core services. The best way to do this in a way that prevents data loss is discussed in Change Plugin.stop signature to allow asynchronous stop lifecycle #83612

From https://github.com/rudolf/kibana/blob/8f10cf22a93bd87a316a13a7965d4f5c8f160738/rfcs/text/0013_saved_object_migrations.md#4213-upgrade-and-rollback-procedure

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-27T16:32:11Z

Pinging @elastic/kibana-core (Team:Core)

pgayvallet · 2020-11-30T08:55:59Z

Allow any existing HTTP requests to complete with a configurable timeout before forcefully terminating any open connections. This timeout should be configurable (is 30s a good default?)

From https://github.com/igorgolovanov/hapi-graceful-shutdown/blob/master/lib/index.js, it seems it can easily be performed by providing a timeout parameter when calling server.stop. I also assume this would handle point iv, Closes any keep-alive sockets by sending a connection: close header.

However if we need to support the except the /api/status part, I guess we'll need to wrap all our route handlers to return 503's once the shutdown process started instead, and then call server.stop({ timeout: 0}) after the given timeout to 'manually' manage the pending requests timeout / graceful server stop. Not sure how to close the keep-alive connections before the server terminates though.

Point 5, Shutdown all plugins and Core services, is the only one that does not depend on underlying libs, so it should be rather easy. (and I think is already mostly, if not totally, developed?)

afharo · 2020-11-30T09:33:39Z

I think i. and ii. can be easily implemented with a middleware that is a simple pass-through (calls next during normal operation), and replies 503 the moment the SIGTERM signal is received. Maybe registering them as onRequest or onPreAuth?

I also think the header connection: close addition (iv.) can be handled by a onPreResponse middleware.

rudolf · 2020-11-30T10:39:11Z

Return 503 from all registered routes except the /api/status and the healthcheck endpoints.

I guess this is not strictly necessary. A graceful shutdown should be a fairly short period of time so user's don't really need to query the status api to see what's happening and the load balancer will anyway remove the node so in most deployments it won't even be possible to reach the endpoint.

Just checked, hapi's server.stop will close any keep-alive sockets https://github.com/hapijs/hapi/blob/master/lib/core.js#L449-L456

pgayvallet · 2020-11-30T10:58:36Z

I guess this is not strictly necessary

That would greatly facilitate the implementation, as we could just let HAPI handle it entirely then.

afharo · 2021-04-14T15:57:52Z

I've been testing it locally and I've got some relieving data:

Dev mode deals with SIGTERM and SIGINT differently than node script/kibana (which I guess it's closer to the behaviour in production).
node script/kibana already applies hapi's default graceful shutdown (5s).
I've tested it with an API taking 4s to respond and Ctrl+C Kibana's process.

The observed behaviour:

Hapi does wait for up to 5s to resolve the pending requests.
Those requests are resolved with connection: close.
It closes the connection of any pending requests that take longer than the timeout to resolve.

There are some gotchas though:

What is the best default timeout? Are the current 5s good enough?
New incoming requests are simply rejected (it does not respond 503, it simply does not accept any new incoming connections). Is this good enough?
We need to make sure we set the right order for stopping services. At the moment we close the SavedObjects and Elasticsearch services before the HTTP service. This means if the HTTP requests depend on any requests to complete. Their clients will be closed and the requests will fail.

Some logs are printed when the process looks like finished in the terminal:

[2021-04-14T17:36:58.906+02:00][DEBUG][http.server.Kibana] registering route handler for [/status]
[2021-04-14T17:36:58.923+02:00][INFO ][http.server.Kibana] http server running at http://localhost:5601
  log   [17:36:58.943] [info][plugins][watcher] Enabling Watcher plugin.
  log   [17:36:58.968] [info][kibana-monitoring][monitoring][monitoring][plugins] Starting monitoring stats collection
  log   [17:36:58.991] [info][plugins][securitySolution] Dependent plugin setup complete - Starting ManifestTask
[2021-04-14T17:37:01.962+02:00][DEBUG][http.server.Kibana.cookie-session-storage] Error: Unauthorized
[2021-04-14T17:37:01.969+02:00][DEBUG][http.server.Kibana.cookie-session-storage] Error: Unauthorized
^C  log   [17:37:04.907] [info][plugins-system] Stopping all plugins.
  log   [17:37:04.909] [info][kibana-monitoring][monitoring][monitoring][plugins] Monitoring stats collection is stopped

[2021-04-14T17:37:04.970+02:00][DEBUG][http.server.Kibana] stopping http server
➜  kibana git:(master) ✗ [2021-04-14T17:37:07.676+02:00][DEBUG][http.server.response] POST /api/telemetry/v2/clusters/_stats 200 5719ms - 27.7KB
[2021-04-14T17:37:07.678+02:00][DEBUG][http.server.Kibana] http server stopped

Mind the log entries ^C and ➜ kibana git:(master) ✗

I'll create a PR:

Changing the order to make sure the HTTP service is stopped before any other potential dependencies.
Exposing the graceful_shutdown_timeout config parameter (defaulting to the current 5s?).
Adding some integration tests to prove this behaviour.

Regarding the default timeout value, there is not a common agreed value in different architectures:

Systemd has a default of 90s
Kubernetes' is 30s
Docker waits up to 10s by default

What do you think?

mshustov · 2021-04-14T16:38:09Z

Regarding the default timeout value, there is not a common agreed value in different architectures:

maybe up to 30sec, there might be some long-running requests to ES?

New incoming requests are simply rejected (it does not respond 503, it simply does not accept any new incoming connections). Is this good enough?

good to start.

Some logs are printed when the process looks like finished in the terminal:

Is it in the dev mode only?

afharo · 2021-04-14T17:06:24Z

maybe up to 30sec, there might be some long-running requests to ES?

Happy to use that default. Are we OK with docker killing the process earlier by default?

Some logs are printed when the process looks like finished in the terminal:

Is it in the dev mode only?

Nope, running node scripts/kibana. I think it's related to the nature that process.on listeners have to be sync 🤔

afharo · 2021-04-20T14:11:26Z

Reopening to run the necessary follow-ups:

Document the new config parameter server.shutdownTimeout and the behaviour ([DOCS] server.shutdownTimeout #97678)
Double-check with Cloud and Ops teams if any follow-ups are required.

afharo · 2021-04-20T15:31:14Z

Double-check with Cloud and Ops teams if any follow-ups are required.

The Cloud team confirmed they already stop redirecting requests to Kibana before shutting it down.
For the Ops team, I've created this issue Adapt Graceful shutdown timeouts depending on the service #97658 for them to review their limits.

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Nov 27, 2020

rudolf mentioned this issue Dec 3, 2020

[New-Platform][Discuss] Elasticsearch connection availability #43456

Closed

pgayvallet assigned pgayvallet and rudolf and unassigned pgayvallet Dec 16, 2020

mshustov assigned mshustov and unassigned rudolf Feb 28, 2021

joshdover assigned afharo and unassigned mshustov Apr 13, 2021

afharo mentioned this issue Apr 15, 2021

HTTP-Server: Graceful shutdown #97223

Merged

9 tasks

afharo closed this as completed in #97223 Apr 20, 2021

afharo reopened this Apr 20, 2021

afharo mentioned this issue Apr 20, 2021

Adapt Graceful shutdown timeouts depending on the service #97658

Closed

afharo closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGTERM should gracefully shutdown Kibana #84452

SIGTERM should gracefully shutdown Kibana #84452

rudolf commented Nov 27, 2020 •

edited

Loading

elasticmachine commented Nov 27, 2020

pgayvallet commented Nov 30, 2020 •

edited

Loading

afharo commented Nov 30, 2020

rudolf commented Nov 30, 2020

pgayvallet commented Nov 30, 2020

afharo commented Apr 14, 2021 •

edited

Loading

mshustov commented Apr 14, 2021

afharo commented Apr 14, 2021

afharo commented Apr 20, 2021 •

edited

Loading

afharo commented Apr 20, 2021

SIGTERM should gracefully shutdown Kibana #84452

SIGTERM should gracefully shutdown Kibana #84452

Comments

rudolf commented Nov 27, 2020 • edited Loading

elasticmachine commented Nov 27, 2020

pgayvallet commented Nov 30, 2020 • edited Loading

afharo commented Nov 30, 2020

rudolf commented Nov 30, 2020

pgayvallet commented Nov 30, 2020

afharo commented Apr 14, 2021 • edited Loading

mshustov commented Apr 14, 2021

afharo commented Apr 14, 2021

afharo commented Apr 20, 2021 • edited Loading

afharo commented Apr 20, 2021

rudolf commented Nov 27, 2020 •

edited

Loading

pgayvallet commented Nov 30, 2020 •

edited

Loading

afharo commented Apr 14, 2021 •

edited

Loading

afharo commented Apr 20, 2021 •

edited

Loading