Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate liveness and readiness probes for router pods #19009

Merged
merged 2 commits into from
May 22, 2018

Conversation

JacobTanenbaum
Copy link
Contributor

@JacobTanenbaum JacobTanenbaum commented Mar 16, 2018

Create upstream commit that allows for multiple groups of checks to be associated with health checking. Using the multiple groupings differentiate the liveness and readiness probes for the Haproxy router

Bug 1550007

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 16, 2018
@knobunc knobunc self-assigned this Mar 16, 2018
@knobunc
Copy link
Contributor

knobunc commented Mar 16, 2018

@smarterclayton is this a reasonable way to approach the problem?

@knobunc
Copy link
Contributor

knobunc commented Mar 16, 2018

@openshift/networking PTAL

@knobunc knobunc added kind/bug Categorizes issue or PR as related to a bug. component/routing labels Mar 16, 2018
Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
But this is just separating the urls. The logic served on either needs to be differentiated too. So there is going to be a part 2? Or am I missing something?

@@ -436,7 +436,7 @@ func generateSecretsConfig(cfg *RouterConfig, namespace string, defaultCert []by
return secrets, volumes, mounts, nil
}

func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe {
func generateProbeConfigForRouter(Path string, cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camelCase for 'Path'

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Rajat pointed out, need to camelCase the variable. Otherwise, LGTM

@pravisankar
Copy link

@rajatchopra I don't think there will be part 2 here.
https://github.com/openshift/origin/pull/19009/files#diff-43ed12581474d0d4e9bd5b12bdcf6417R40
creates additional '/livez' endpoint (eg: http://localhost:1936/livez) which always returns success and this is used for liveness probe. Success from this endpoint means router is up but not ready until /healthz returns success.

@JacobTanenbaum JacobTanenbaum force-pushed the BZ1550007 branch 2 times, most recently from fefb00b to 1f93335 Compare March 19, 2018 14:22
@@ -447,7 +447,7 @@ func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort)
}

probe.Handler.HTTPGet = &kapi.HTTPGetAction{
Path: "/healthz",
Path: path,
Port: intstr.IntOrString{
Type: intstr.Int,
IntVal: int32(healthzPort),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though a path is being passed (which is either livez or healthz), the port is always healthz?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cfg.StatsPort > 0 {
healthzPort = cfg.StatsPort
probePort = cfg.StatsPort
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajatchopra @pravisankar does this make it clearer? removing the reference to the healthzPort in favour of a more generic term

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

if cfg.StatsPort > 0 {
healthzPort = cfg.StatsPort
probePort = cfg.StatsPort

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2018
@@ -466,15 +465,15 @@ func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort)
}

func generateLivenessProbeConfig(cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe {
probe := generateProbeConfigForRouter(cfg, ports)
probe := generateProbeConfigForRouter("/livez", cfg, ports)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be /healthz/ready which is our standard readiness check path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/healthz is liveness, /healthz/ready is readiness.

@@ -37,6 +37,10 @@ type Listener struct {
func (l Listener) handler() http.Handler {
mux := http.NewServeMux()
healthz.InstallHandler(mux, l.Checks...)
mux.Handle("/livez", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you actually trying to check for?

@smarterclayton
Copy link
Contributor

This should be following our standard conventions, but you cannot change the meaning of the existing health endpoint without breaking backwards compatibility.

Can you describe what the goal is here - have the endpoint available for load balancers prior to the router actually being loaded? Generally if you want that you just use the service field spec.publishNotReadyAddresses, which doesn't need any of these changes.

@knobunc
Copy link
Contributor

knobunc commented Mar 20, 2018

@smarterclayton the router added ROUTER_BIND_PORTS_AFTER_SYNC so that external load balancers could have useful health checks on port 80 or 443 for the router to know when it was online. Before we added ROUTER_BIND_PORTS_AFTER_SYNC the router would bind those ports immediately and potentially serve HTTP 503 statuses for valid routes because the route had not been loaded yet. So, we added ROUTER_BIND_PORTS_AFTER_SYNC so that it would not bind 80/443 until a full sync had happened, BUT would bind 1936 so that the liveness probes worked (otherwise the router gets killed).

After the refactor to make the router controller return status directly, if you set ROUTER_BIND_PORTS_AFTER_SYNC then you start failing liveness probes. @JacobTanenbaum found out that this was due to the liveness check done by the router controller contacting the haproxies it controlled to see if they were up, and if not, returning false. So, when ROUTER_BIND_PORTS_AFTER_SYNC is set, the haproxy doesn't bind 80 and 443 so the delegated liveness check fails. And the pod gets terminated.

So, the goal here is to have an endpoint indicating that the router controller is live, but that it is not ready yet... but we have to work with what we have at the moment:

  • /healthz -- Returns only when the backends are ready (changed in 3.7)
  • /healthz/backend-http -- Returns only when haproxy is up (added in 3.7)

Before 3.7 /healthz was a liveness check

Can we call this a bugfix and add /healthz/ready and move the multiplexer that is on /healthz there? (Should we also support /healthz/ready/backend-http since that's what the multiplexer will set up)?

Then /healtz can go back to returning true as soon as the router controller becomes active.

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 20, 2018 via email

@knobunc
Copy link
Contributor

knobunc commented Mar 21, 2018

@smarterclayton The problem is that the external load balancers are looking at 80 and 443 directly to decide whether to put it in rotation. BUT the liveness check defined for the pod is on /healthz and that is checking that the haproxy is live too before returning 200. So when we don't bind to 80 the liveness probes fail. What would you suggest we do to make the liveness check work?

I see two options for this

  1. Document that this option is deprecated and only works when the old stats are in use and then create and document proper readiness and liveness checks
  2. Work out some way to make the option work with the current router so we can provide a liveness check

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2018
@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 22, 2018 via email

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 11, 2018
@@ -66,16 +66,25 @@ func NamedCheck(name string, check func(r *http.Request) error) HealthzChecker {
// exactly one call to InstallHandler. Calling InstallHandler more
// than once for the same mux will result in a panic.
func InstallHandler(mux mux, checks ...HealthzChecker) {
InstallPathHandler(mux, "/healthz", checks...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change vendor code, can't we do this stuff in openshift code?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK changing vendor code needs 'UPSTREAM: ' commit in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravisankar I did post an upstream PR to accompany this kubernetes/PR63716. The commit that includes the kubernetes code is tagged UPSTREAM:63716, should I add this tag to the PR title?

No as far was we can tell there is no way that we can do this stuff in only openshift. We use InstallHandler for the checks and currently you can only have one set of checks that all have to pass for both liveness and readiness, The changes in vendor allows us to create two sets of checks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate UPSTREAM:63716 commit is good, no need to add upstream tag to the PR title.

@knobunc
Copy link
Contributor

knobunc commented May 14, 2018

@JacobTanenbaum Can you fix your description to capture the current behavior please.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knobunc
Copy link
Contributor

knobunc commented May 14, 2018

/approve

@knobunc
Copy link
Contributor

knobunc commented May 15, 2018

/assign @deads2k

@knobunc
Copy link
Contributor

knobunc commented May 15, 2018

/retest

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -66,16 +66,25 @@ func NamedCheck(name string, check func(r *http.Request) error) HealthzChecker {
// exactly one call to InstallHandler. Calling InstallHandler more
// than once for the same mux will result in a panic.
func InstallHandler(mux mux, checks ...HealthzChecker) {
InstallPathHandler(mux, "/healthz", checks...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate UPSTREAM:63716 commit is good, no need to add upstream tag to the PR title.

@pravisankar
Copy link

/retest

@knobunc
Copy link
Contributor

knobunc commented May 18, 2018

@deads2k can you approve the upstream commit to vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver please. (Since kubernetes/kubernetes#63716 landed upstream)

Thanks

@JacobTanenbaum
Copy link
Contributor Author

/retest

@knobunc
Copy link
Contributor

knobunc commented May 21, 2018

@deads2k -- can you approve this since the upstream PR has merged.

…e path to be associated with health checking.

Currently it is only possible to have one group of checks which must all pass for the handler to report success.
Allowing multiple paths for these checks allows use of the same machinery for other kinds of checks, i.e. readiness.

This upstream change allows for the differentiation of health and readiness checks
Add a backend to the router controller "/livez" that always returns true. This differentiates the liveness and
readiness probes so that a router can be alive and not ready.

Bug 1550007
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2018
@deads2k
Copy link
Contributor

deads2k commented May 21, 2018

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2018
@knobunc
Copy link
Contributor

knobunc commented May 22, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, JacobTanenbaum, knobunc, pravisankar, ramr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 13d42ec into openshift:master May 22, 2018
@@ -35,6 +36,28 @@ func HTTPBackendAvailable(u *url.URL) healthz.HealthzChecker {
})
}

// HasSynced returns a healthz check that verifies the router has been synced at least
// once.
func HasSynced(router **templateplugin.TemplatePlugin) healthz.HealthzChecker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of construct is a violation of our style guides. You should be passing an interface here that exposes the correct check method. Double pointers should never be used. Nothing in metrics should be aware of template plugin at all.

func HasSynced(router **templateplugin.TemplatePlugin) healthz.HealthzChecker {
return healthz.NamedCheck("has-synced", func(r *http.Request) error {
if router != nil {
if (*router).Router.SyncedAtLeastOnce() == true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construct is not appropriate. It should always be if booleancondition {

if router != nil {
if (*router).Router.SyncedAtLeastOnce() == true {
return nil
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you return early, elide the else, as per our style guide.

if (*router).Router.SyncedAtLeastOnce() == true {
return nil
} else {
return fmt.Errorf("Router not synced")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors should always be lower case, as per the style guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants