Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring kube client timeouts #8864

Closed
cnmcavoy opened this issue Jun 15, 2023 · 9 comments · Fixed by #8917
Closed

Allow configuring kube client timeouts #8864

cnmcavoy opened this issue Jun 15, 2023 · 9 comments · Fixed by #8917
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@cnmcavoy
Copy link
Contributor

What would you like to be added (User Story)?

As a cluster administrator, I want to be able to configure the timeout for the kubernetes client configuration instead of accepting the default hardcoded 10s timeout.

Detailed Description

The kube client configuration values used by the cluster API should be configurable from the cli flags at startup. The QPS and Burst settings are already controllable by the user, but the timeout is hardcoded to 10s.

This timeout is used in various places, notably in the CAPI drainNode logic, where in larger clusters, the eviction API + latency may result in the control plane taking > 10s some of the time to begin responding to the requests. When the API server fails to respond in time, Cluster API gives up and retries, immediately in another reconcile loop, which often increases the load on the control plane. I would like to be able to configure a longer timeout for these clusters.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 15, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

sounds like a good idea IMO to make this client configurable.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 16, 2023
@fabriziopandini
Copy link
Member

I think we should clarify a little bit the ask here

In CAPI we have two types of clients, the clients reading/writing to the management cluster, and the clients reading/writing from the workload cluster.

AFAIK, as of today we allow setting QPS and burst only for the first set of clients, while we do not provide any form of configuration for the clients reading/writing from the workload clusters, which are the one used for drain.

Also, draining a node is a very special case, where we delegate timeout management to the drainer itself and currently this is hardcoded to 20sec (https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/internal/controllers/machine/machine_controller.go#L603-L625)

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Jun 20, 2023

The ask is about configuring the HTTP client timeout for the workload kubernetes client that drains nodes, as written in the description. However, we should consider configuring the timeout for other workload cluster clients as well.

The kubernetes config created for draining nodes uses the same config as is created for accessing workload clusters, here: https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/internal/controllers/machine/machine_controller.go#L582

That kubernetes configuration always has a 10s golang HTTP client timeout configured: https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/controllers/remote/cluster.go#L52-L64. We see log messages like when our clusters exceed 10s to respond to eviction:

"Drain failed, retry in 20s" err="Get \"https://awscmh3.k8s-api.indeed.tech:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-163-198-10.us-east-2.compute.internal\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Also, draining a node is a very special case, where we delegate timeout management to the drainer itself and currently this is hardcoded to 20sec (https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/internal/controllers/machine/machine_controller.go#L603-L625)

The 20s pod eviction timeout isn't relevant, the timeout we see in larger clusters is the 10s timeout waiting for the API Server to respond with headers to the request.

@sbueringer
Copy link
Member

We see log messages like when our clusters exceed 10s to respond to eviction:

So the APIserver is not able to list the Pods of a Node within 10 seconds. Got it. Now it makes more sense to me, based on the issue description I thought the actual eviction call for the Pod doesn't finish within 10 seconds which would have surprised me.

Mostly out of curiosity, do you know how long this call would take to respond for those clusters?

Coming back to making the timeout configurable.

First I would suggest that we start using the rest config from the ClusterCacheTracker consistently. Unfortunately we currently have a mix of using the ClusterCacheTracker and RESTConfig directly. This is not great for scalability as RESTConfig requires a call to the apiserver while the ClusterCacheTracker already caches the config. I opened a PR for that: #8894

I think now we have different options to affect the timeout used during drain:

  1. make the timeout of the ClusterCacheTracker configurable
  2. make the timeout used for drain configurable (we can customize the rest config there)

The problem with the first option is that if you increase the timeout used for the entire ClusterCacheTracker / all clients we use for the workload clusters reconcilers can get stuck when workload clusters are not reachable.

Let's assume you set the timeout to a minute and then you have some workload clusters that become not reachable. The result is that either during client creation or in any other server-side call reconcilers can get stuck up until this timeout (a minute). The consequence is that some of the workers used for the Reconcilers can't be used to reconcile any of the healthy clusters. We saw this already in production. The result basically being that the entire core CAPI controller can get stuck if you are not using a worker count which is equal to the number of your Machines.

Given this, I think option 2 is probably the better solution for your problem. As with option 1 actually configuring a high timeout comes with severe trade-offs.

@fabriziopandini
Copy link
Member

Thanks for the context, it really helped me
I agree with what @sbueringer said; also this is somehow in line with what we are discussing in #8893 (one timeout for clients, different timeouts for specific operations)

The only caveat here is what kind of UX are thinking here given that this is something that can go at the machine level, and how this relates to the existing nodeDrainTimeout

@cnmcavoy
Copy link
Contributor Author

Mostly out of curiosity, do you know how long this call would take to respond for those clusters?

10-11s. We are barely hitting the timeout, and only in our largest cluster, with 30-40k pods and 600-800 nodes (the cluster autoscales up and down daily). I can provide more metrics from the controllers if there are specific metrics that are of interest. Usually we are barely missing the timeout, some of the time retrying succeeds, but that retrying has to happen at all slows down the machinedeployment rollout.

First I would suggest that we start using the rest config from the ClusterCacheTracker consistently. Unfortunately we currently have a mix of using the ClusterCacheTracker and RESTConfig directly. This is not great for scalability as RESTConfig requires a call to the apiserver while the ClusterCacheTracker already caches the config. I opened a PR for that: #8894

... I see. I was a bit confused by the way the kubernetes config was used, and this answers part of my confusion.

The only caveat here is what kind of UX are thinking here given that this is something that can go at the machine level, and how this relates to the existing nodeDrainTimeout

Good point, #8865 changes the option globally, but really all the ask here is about the machine controller. So putting together the comments on #8865 and here, would a better approach be to start over and define some sort of DrainOptions struct, which contains the timeout (and I guess, the other drain tunables) so it can be initialized by flags or by a library consumer of CAPI and be passed to the Machine Controller? Or is some other approach entirely desired?

@sbueringer
Copy link
Member

sbueringer commented Jun 22, 2023

Thx for the data, just wanted to get a feeling for the scenario/ the latency.

Okay good so sounds like we have consensus on only modifying the drain timeout.

Just fyi in KCP we already have the case where we hard-code a timeout to 30 sec. So we could consider hard-coding the drain client timeout to 20 or 30 seconds.

If we want to make it configurable we have 2 options. Either as command-line flag for the core controller or a field on the Machine CRD.

On one side the Machine CRD already has 3 timeout fields. On the other side controllers have the mgmt cluster qps flags. (KCP controller also has Etcd timeout flags)

Maybe a flag is the best option if it's not strictly necessary to fine-tune the timeout per Cluster/MD. A CRD field would have to be added to ClusterClass, Cluster.spec.topology, KCP, MD, MS, Machine and MachinePool.

(I prefer the flag option)

If we add it as a flag, I think we don't need a util for this flag as we would add it only to the core controller.A separate PR to introduce a flag util for the qps flags would be nice though

@fabriziopandini
Copy link
Member

fabriziopandini commented Jun 26, 2023

I'm ok with the flag

(long version)
qps flags are for the management cluster client, so they can hardly be linked to single machine/cluster (ok to have flags here); drain instead can be applied to a single machine, but we already have some prior art where we used flags for similar use cases, namely etcd-dial-timeout-duration and etcd-call-timeout-duration, so I think it is ok also in this case using a flag for setting a default for all the clusters/machines.

I will defer changing the API when we have a clear use case for it, but probably worth opening an issue of kind api-changes to track the idea for when wi will start discussing the next API version

@sbueringer
Copy link
Member

sbueringer commented Aug 21, 2024

@cnmcavoy I'm currently refactoring the drain logic. Part of that is getting rid of the separate kubernetes client we use for drain. Instead I'll use the client we always use when communicating with workload clusters (the client cached in the ClusterCacheTracker). This client currently uses a hard-coded 10s timeout.

So if I'll just change it that way without further changes we'll hit a regression regarding this issue.

But I noticed that our Pod list call wasn't using pagination. Do you think a 10s timeout could be enough when using pagination (Limit/Continue) when listing Pods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants