Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGTERM should gracefully shutdown Kibana #84452

Closed
rudolf opened this issue Nov 27, 2020 · 10 comments · Fixed by #97223
Closed

SIGTERM should gracefully shutdown Kibana #84452

rudolf opened this issue Nov 27, 2020 · 10 comments · Fixed by #97223
Assignees
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Nov 27, 2020

To avoid data loss when Kibana is stopped (either to restart a process or upgrade Kibana), we should support a graceful shutdown which stops receiving new work and tries to complete all in-progress work before the process is terminated:

  1. When Kibana receives a SIGTERM signal it starts a graceful shutdown and performs the following steps:
    1. Node starts returning 503 from it's healthcheck endpoint to signal to the load balancer that it's no longer accepting new traffic (requires Create health check / readiness endpoint #46984).
    2. Return 503 from all registered routes (edit: do we want to keep responding from /api/status and the healthcheck endpoints?). This provides a strong guarantee that Kibana won't accept any new work/long running connections. As a first step we could rely on users correctly configuring their load-balancer to no longer send any traffic, but this feels error prone and doesn't help users who don't run Kibana behind a load balancer.
    3. Allow any existing HTTP requests to complete with a configurable timeout before forcefully terminating any open connections. This timeout should be configurable (is 30s a good default?) (edit: can be implemented by calling await hapiServer.stop({ timeout: 30 * 1000 });
    4. Closes any keep-alive sockets by sending a connection: close header. (edit: already taken care of by hapi's server.stop() https://github.com/hapijs/hapi/blob/master/lib/core.js#L449-L456)
    5. Shutdown all plugins and Core services. The best way to do this in a way that prevents data loss is discussed in Change Plugin.stop signature to allow asynchronous stop lifecycle #83612

From https://github.com/rudolf/kibana/blob/8f10cf22a93bd87a316a13a7965d4f5c8f160738/rfcs/text/0013_saved_object_migrations.md#4213-upgrade-and-rollback-procedure

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Nov 27, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor

pgayvallet commented Nov 30, 2020

Allow any existing HTTP requests to complete with a configurable timeout before forcefully terminating any open connections. This timeout should be configurable (is 30s a good default?)

From https://github.com/igorgolovanov/hapi-graceful-shutdown/blob/master/lib/index.js, it seems it can easily be performed by providing a timeout parameter when calling server.stop. I also assume this would handle point iv, Closes any keep-alive sockets by sending a connection: close header.

However if we need to support the except the /api/status part, I guess we'll need to wrap all our route handlers to return 503's once the shutdown process started instead, and then call server.stop({ timeout: 0}) after the given timeout to 'manually' manage the pending requests timeout / graceful server stop. Not sure how to close the keep-alive connections before the server terminates though.

Point 5, Shutdown all plugins and Core services, is the only one that does not depend on underlying libs, so it should be rather easy. (and I think is already mostly, if not totally, developed?)

@afharo
Copy link
Member

afharo commented Nov 30, 2020

I think i. and ii. can be easily implemented with a middleware that is a simple pass-through (calls next during normal operation), and replies 503 the moment the SIGTERM signal is received. Maybe registering them as onRequest or onPreAuth?

I also think the header connection: close addition (iv.) can be handled by a onPreResponse middleware.

@rudolf
Copy link
Contributor Author

rudolf commented Nov 30, 2020

Return 503 from all registered routes except the /api/status and the healthcheck endpoints.

I guess this is not strictly necessary. A graceful shutdown should be a fairly short period of time so user's don't really need to query the status api to see what's happening and the load balancer will anyway remove the node so in most deployments it won't even be possible to reach the endpoint.

Just checked, hapi's server.stop will close any keep-alive sockets https://github.com/hapijs/hapi/blob/master/lib/core.js#L449-L456

@pgayvallet
Copy link
Contributor

I guess this is not strictly necessary

That would greatly facilitate the implementation, as we could just let HAPI handle it entirely then.

@afharo
Copy link
Member

afharo commented Apr 14, 2021

I've been testing it locally and I've got some relieving data:

  1. Dev mode deals with SIGTERM and SIGINT differently than node script/kibana (which I guess it's closer to the behaviour in production).
  2. node script/kibana already applies hapi's default graceful shutdown (5s).
  3. I've tested it with an API taking 4s to respond and Ctrl+C Kibana's process.

The observed behaviour:

  • Hapi does wait for up to 5s to resolve the pending requests.
  • Those requests are resolved with connection: close.
  • It closes the connection of any pending requests that take longer than the timeout to resolve.

There are some gotchas though:

  1. What is the best default timeout? Are the current 5s good enough?
  2. New incoming requests are simply rejected (it does not respond 503, it simply does not accept any new incoming connections). Is this good enough?
  3. We need to make sure we set the right order for stopping services. At the moment we close the SavedObjects and Elasticsearch services before the HTTP service. This means if the HTTP requests depend on any requests to complete. Their clients will be closed and the requests will fail.
  4. Some logs are printed when the process looks like finished in the terminal:
    [2021-04-14T17:36:58.906+02:00][DEBUG][http.server.Kibana] registering route handler for [/status]
    [2021-04-14T17:36:58.923+02:00][INFO ][http.server.Kibana] http server running at http://localhost:5601
      log   [17:36:58.943] [info][plugins][watcher] Enabling Watcher plugin.
      log   [17:36:58.968] [info][kibana-monitoring][monitoring][monitoring][plugins] Starting monitoring stats collection
      log   [17:36:58.991] [info][plugins][securitySolution] Dependent plugin setup complete - Starting ManifestTask
    [2021-04-14T17:37:01.962+02:00][DEBUG][http.server.Kibana.cookie-session-storage] Error: Unauthorized
    [2021-04-14T17:37:01.969+02:00][DEBUG][http.server.Kibana.cookie-session-storage] Error: Unauthorized
    ^C  log   [17:37:04.907] [info][plugins-system] Stopping all plugins.
      log   [17:37:04.909] [info][kibana-monitoring][monitoring][monitoring][plugins] Monitoring stats collection is stopped
    
    [2021-04-14T17:37:04.970+02:00][DEBUG][http.server.Kibana] stopping http server
    ➜  kibana git:(master) ✗ [2021-04-14T17:37:07.676+02:00][DEBUG][http.server.response] POST /api/telemetry/v2/clusters/_stats 200 5719ms - 27.7KB
    [2021-04-14T17:37:07.678+02:00][DEBUG][http.server.Kibana] http server stopped
    
    Mind the log entries ^C and ➜ kibana git:(master) ✗

I'll create a PR:

  1. Changing the order to make sure the HTTP service is stopped before any other potential dependencies.
  2. Exposing the graceful_shutdown_timeout config parameter (defaulting to the current 5s?).
  3. Adding some integration tests to prove this behaviour.

Regarding the default timeout value, there is not a common agreed value in different architectures:

  1. Systemd has a default of 90s
  2. Kubernetes' is 30s
  3. Docker waits up to 10s by default

What do you think?

@mshustov
Copy link
Contributor

Regarding the default timeout value, there is not a common agreed value in different architectures:

maybe up to 30sec, there might be some long-running requests to ES?

New incoming requests are simply rejected (it does not respond 503, it simply does not accept any new incoming connections). Is this good enough?

good to start.

Some logs are printed when the process looks like finished in the terminal:

Is it in the dev mode only?

@afharo
Copy link
Member

afharo commented Apr 14, 2021

maybe up to 30sec, there might be some long-running requests to ES?

Happy to use that default. Are we OK with docker killing the process earlier by default?

Some logs are printed when the process looks like finished in the terminal:

Is it in the dev mode only?

Nope, running node scripts/kibana. I think it's related to the nature that process.on listeners have to be sync 🤔

@afharo
Copy link
Member

afharo commented Apr 20, 2021

Reopening to run the necessary follow-ups:

  • Document the new config parameter server.shutdownTimeout and the behaviour ([DOCS] server.shutdownTimeout #97678)
  • Double-check with Cloud and Ops teams if any follow-ups are required.

@afharo
Copy link
Member

afharo commented Apr 20, 2021

  • Double-check with Cloud and Ops teams if any follow-ups are required.

@afharo afharo closed this as completed Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants