- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Future work
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The kube-apiserver watch cache is initialized from etcd at the moment when it starts with empty change history. As a consequence, clients that want to resume a watch immediately after kube-apiserver reboots almost always have a resource version that is out of the history window. In this KEP we propose how to address this problem.
In order to support large Kubernetes clusters without huge number of etcd
and kube-apiserver replicas, kube-apiserver contains an in-memory cache
called watch cache
from which all watch requests as well as non-quorum
reads (in opt-in fashion) are served. Watch cache is being propagated
directly from etcd using typical "list + watch" pattern and stores both
the current state as well as some recent 'transaction log'. There is a
separate watch cache for each resource type version.
Digging into more details, the way how watch cache is propagated is:
- on start, quorum list is send to etcd
- given the RV of the list represents the current "global etcd version", it doesn't have to reflect any change of objects of our type
- the watch cache is set to be synced to list resource version (which as mentioned above isn't necessary reflecting any change of objects of that type)
- from the on, watch to etcd is established and watch cache is being updated by incoming watch event stream; note that only objects of a given resource type are being watched and only those can update the resource version to which watch cache is synced from then on
- if watch cannot be reestablished, all watchers connected to it are disconnected, a new quorum list is send and watchcache is reset to that point in time
It's important to note that while only objects of a given resource type are watched (thus resource version to which watch cache is synced can only be updated to a value of a change (create/update/delete) of some object of that type), the quorum ist, even though it is only requesting objects of that resource type, returns "globally current version for the whole etcd cluster" (objects of different types may be stored in different etcd clusters, and only those that are stored in the same etcd cluster matter).
While the inability to reestablish watch doesn't happen often in practice, the kube-apiserver restarts obviously does (e.g. on upgrades). And this is resulting in two main problems:
-
In case of rolling upgrade of kube-apiserver, when no object of a given resource type is changing (but objects of other types do), all watchers will eventually be forced to relist, causing significant performance and scalability issues for larger clusters. See the example below for clarification.
-
If no object of a given resource type is changing after kube-apiserver startup, watch cache may stuck being synced to resource version not being version of any object of a given type for extended period of time. See kubernetes/kubernetes#91073 for more details.
To illustrate the problem better, imagine the following example:
-
there is a single kube-apiserver that we are going to restart
-
there are only two resource types: X and Y (for simplicity) [Y is needed to have RV changes by non-X objects]
-
there is one object of type X: x with rv=100
-
there is one object of type Y: y with rv=101
-
there is a watch W for objects of type X already synced to rv=100
-
watchcache for both X and Y is up-to-date
-
T1: kube-apiserver being restarted
-
T2: kube-apiserver is initialized (via quorum LIST from etcd) with watchcache for X synced to rv=101 (most recent etcd version)
-
T3: watch W is trying to reconnect, given rv=100 is outside of supported history window (we only support versions from 101 being the moment of watch cache initialization), relist is forced
Note that adding more kube-apiservers doesn't solve the problem. In fact, it is introducing the second problem where watch cache across kube-apiserver instances may be out of sync for extended period of time.
To illustrate this problem imagine the following example:
-
the setup is as above, with the only difference of two kube-apiservers
-
T1: kube-apiserver-1 being restarted
-
T2: kube-apiserver-1 is initialized with watchcache for X synced to rv=101 (most recent etcd version) [as above]
-
T3: object y is being updated to rv=102
-
T4: kube-apiserver-2 is being restarted
-
T5: kube-apiserver-2 is initialized with watchcache for X synced to rv=102 (most recent etcd version)
As long as no object of type X is being touched (created, updated or deleted), watchcache for X will not be updated. So even though they contain the same set of objects, one claims to be synced to rv=101 and the other claims to be synced to rv=102. This may result in resuming watches across api-servers to suffer from "too old resource version" errors in a steady state.
From the first glance it may look like having many kube-apiservers can mitigate this problem. However, if no object of type X is being changed for extended period of time, this doesn't help. We've seen large production clusters with tens of thousands of pods, where no pod was changing for extended periods of time (e.g. tens of minutes).
The goal of this proposal is to avoid both of these problems.
- avoid tons of relists during kube-apiservers rolling upgrades
- avoid different instances of kube-apiserver stuck with watchcache synced to different resource versions for extended period of time
- allow consistent reads from kube-apiserver cache (this proposal makes it easier but it's not the goal to solve it)
We propose to utilize the 'progress notify' feature from etcd to solve the problem.
Since version 3.0, etcd watch can be configured with WithProgressNotify
enabled. In that case, every N (hard-coded to 10 minutes in the code) etcd
checks if any event was send to the watcher within that interval and if not
sends a special progress notify containing the current etcd resource version
is send to the watcher.
We are going to utilize this feature to solve the problems described above.
-
We will work with etcd team to make the interval configurable. It should be fairly simple - POC
-
We modify all watches used to propagate watchcache to set the
WithProgressNotify
option. For watches being served from etcd in case of disabled watchcache, this should remain unchanged. We can also consider automatically translating to bookmarks if the overhead won't be too large. -
We will modify the kubernetes, so that it can understand progress notify events and use them to update the so-far resource version. This is fairly simple given the already existing support for watch Bookmark events. The only change that will be exposed in the client-go libraries will be the change to reflector so that it can update underlying store resource version based on incoming Bookmark event. Basically instead of current:
...
case watch.Bookmark:
// A `Bookmark` means watch has synced here, just update the resourceVersion
default:
...
we will modify it to:
...
case watch.Bookmark:
// A `Bookmark` means watch has synced here, just update the resourceVersion
+ if rvu, ok := r.store.(resourceVersionUpdater); ok {
+ rvu.UpdateResourceVersion(newResourceVersion)
+ }
default:
...
where resourceVersionUpdate is a simple interface implementing just
UpdateResourceVersion(resourceVersion string)
function.
-
Change watch cache to utilize the resource version updates from Bookmark events.
-
We will set the progress notify period to reasonably small value. The requirement is to ensure that in case of rolling upgrade of multiple kube-apiservers, the next-to-be-updated one will get either a real event or a progress notify one with a version at least as fresh as the version to which the just-upgraded one was initialized (based on that version it can then send bookmarks to its watchers on shutdown). Given we're going to send bookmarks on shutdown, we can wait for some short period until progress notify event come, so values of [1s, 10s] seem reasonable (we can also set 1s frequency and wait up to 10s for the delivery on shutdown to tolerate delays). In the past, it was successfully scale tested up until 250ms - see kubernetes/kubernetes#86769 (comment) so performance/scalability shouldn't be a problem.
Note that if due to some races/issues single ProgressNotify event won't be delievered to a subset of kube-apiservers, this is not a problem, because:
- generally subsequent watch will be send to the same kube-apiserver due to http/http2 connection stickiness in Golang library (unless there is no disruption, which shouldn't be frequent situation)
- even if it wouldn't be the case, that can only cause issues on watchcache initialization; once all of them are initialized then (a) if watch is broken in newer kube-apiserver and reconnects to the older one it is fine because watch can be started with future resource version (b) if watch is broken on older kube-apiserver and reconnects to the newer one, we won't delete change history on ProgressNotify, so that older version should still be stored there (unless there is heavy churn of those objects which is the case that doesn't suffer from this problem)
The POC PR can be found in: kubernetes/kubernetes#92472
The biggest risk are bugs in the implementation. To mitigate this, the
implementation will be hidden behind EfficientWatchResumption
feature
gate and necessary tests will be added and/or extended (details below).
- unit tests for logic enhancing resource version tracking in reflector
- unit tests for newly added watch cache logic
- integration test for sending bookmark on kube-apiserver shutdown
- integration test for proving that resource version that kube-apiserver can serve from cache progresses eventually when objects of other types are being added/updated/deleted; this test should store events (or other type) in a separate etcd cluster (to test split-etcd backend mode) and ensure no RV leak across etcd clusters
Alpha should provide basic functionality covered with tests described above.
- Appropriate metrics are agreed on and implemented
- Ad-hoc manual rolling-upgrade of kube-apiservers in 5k-node cluster is not resulting in required re-listing for watched resources from node components
- Enabled in Beta for at least two releases without complaints
- Benefits of the feature during rolling upgrade of kube-apiserver at scale (thousands of nodes) proven in production
Kubernetes can be safely updated/downgraded, as the implementation is purely in memory:
- if etcd doesn't support frequent enough progress notify events, we won't get expected benefits (problems may not be addressed), but also no unexpected consequences
- enabling the feature may only result in additional watch bookmark events for clients, which they are explicitly opting-in anyway
- disabling the feature reverts the behavior of watchcache being synced to values of objects of different types; however given the initialization is happening at "now" anyway, the time won't go back
n/a - watch bookmarks don't have any frequency guarantees
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: EfficientWatchResumption
- Components depending on the feature gate: kube-apiserver
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? No.
-
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes, watchcache (and watch bookmark events) will not be propagated with resource versions of objects of other types.
-
What happens if we reenable the feature if it was previously rolled back? The expected behavior will be restored.
-
Are there any tests for feature enablement/disablement? No.
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? In case of bugs, etcd progress notify events may be incorrectly parsed leading to kube-apiserver crashes. It can't affect running workloads.
-
What specific metrics should inform a rollback? Crashes of kube-apiserver.
-
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Manual tests were run in GKE Regional Cluster with 3 master nodes. No problems were identified during those tests (which no state is stored in etcd associated with this feature).
Additionally, an upgrade path was also tested in GKE Regional Cluster with 5000 nodes with the feature enabled. There is a tremendous positive impact on API call latency metrics:
- Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? It's not a workload feature.
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: etcd_bookmark_counts
- Components exposing the metric: kube-apiserver
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? n/a [Bookmark and watch progress notify events are best effort in their nature]
-
Are there any missing metrics that would be useful to have to improve observability of this feature? No
This section must be completed when targeting beta graduation to a release.
-
Does this feature depend on any specific services running in the cluster?
- etcd
- Usage description: We rely on etcd support for ProgressNotify events, that
was added in release 3.3. However, we also rely on ability to configure
notifications period (default of 10m is too high), that was added in 3.5
and backported to 3.4.11.
- Impact of its outage on the feature: etcd outage will translate to cluster outage anyway
- Impact of its degraded performance or high-error rates on the feature: ProgressNotify events may not be send as expected
- Usage description: We rely on etcd support for ProgressNotify events, that
was added in release 3.3. However, we also rely on ability to configure
notifications period (default of 10m is too high), that was added in 3.5
and backported to 3.4.11.
- etcd
-
Will enabling / using this feature result in any new API calls? No. Although new events are being send via etcd to kube-apiserver as part of the open Watch requests.
-
Will enabling / using this feature result in introducing new API types? No
-
Will enabling / using this feature result in any new calls to the cloud provider?
-
Will enabling / using this feature result in increasing size or count of the existing API objects? No
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable? The feature will not work (though it is a control-plane feature, not a workload one.
-
What are other known failure modes? n/a
-
What steps should be taken if SLOs are not being met to determine the problem? n/a
2020-06-30: KEP Proposed. 2020-08-04: KEP marked as implementable. v1.20: Feature graduated to Alpha 2020-01-15: KEP updated to target Beta in v1.21
n/a
The above solution doesn't address the extensive relisting case in the setup with single kube-apiserver. The reason for that is that we don't send Kubernetes Bookmark events on kube-apiserver shutdown (which would actually be beneficial on its own). However, doing that properly together with ensuring that no request weren't dropped in the meantime (even in single kube-apiserver) scenario isn't trivial and probably deserves its own KEP. As a result, we're leving this as a future work.
The main alternative to the above solution would be to change the way how watch cache is initialized to not only initialize the state but also a transaction log (i.e. read the whole etcd history and initialize transaction log based on it).
Pros:
- doesn't require any etcd changes
- no overhead when kube-apiserver is running (only initialization being more expensive)
Cons:
- given kube-apiserver is performing compaction (default every 5m), lack of changes of any object of type X within that period would result in inability to initialize transaction log anyway; so that is not universal solution
- minor: etcd API doesn't expose the last compaction revision that we should start syncing from