KEP-4322: ClusterProfile API

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
- Cluster Name
  - Example 1
  - Example 2
- Spec
  - Display name
  - Cluster Manager
- Status
API Example
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Extending Cluster API Cluster resource
- ClusterProfile CRD scope
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, there is a lack of a standardized approach to define a cluster inventory. However, with the growing number of users managing multiple clusters and deploying applications across them, projects like Open Cluster Management (OCM), Clusternet, Kubernetes Fleet Manager or Karmada have emerged. This document introduces a proposal for a new universal ClusterProfile API. The objective is to establish a shared interface for cluster inventory, defining a standard for status reporting while allowing for multiple implementations.

Motivation

A cluster inventory where users can discover Kubernetes clusters, their properties, and their status, is a common component in almost every major multi-cluster management solution. Yet, at this moment, there is not a standard way to access such an inventory; as we see more and more users embrace the cloud-native approach and deploy workloads across multiple clusters in concert with the help of multi-cluster management solutions, we believe that it is critical to devise a common API where applications, toolsets, and human operators can easily discover clusters under management.

By adopting this new ClusterProfile API, consumers no longer need to concern themselves with the implementation details of various projects. Instead, they can leverage a foundational API for multi-cluster management. Examples of consumers includes:

multiple cluster workload scheduler: we’ve seen requirements on distributing application/workload to multiple clusters. The scheduling can be based on certain cluster properties, e.g. cloud the cluster resides in, resources a cluster provides, or latency to some external endpoints. A common ClusterProfile API will give schedulers a standard to reason about clusters and help to foster the growth of this area.
GitOps tools (ArgoCD, flux etc) are having the requirement to deploy workload to multiple clusters. They either need to build the cluster concept by themselves or understand cluster API from each different cluster management project. A common ClusterProfile API can provide a thin compatible layer for different projects.
Operation tools or customized external consumers: this API gives a common way for different clouds and vendors to define clusters, providing a vendor agnostic integration point for external tools.
Cluster manager implementations themselves, for purposes such as grouping clusters into MCS API clustersets.

Goals

Establish a standardized ClusterProfile API to represent clusters.
Lay the groundwork for multi-cluster tooling by providing a foundational component.
Accommodate multiple implementations to encourage flexibility and adoption.
Allow for future extensions and new use cases by leaving room for further development and enhancements. A unified API, such as this one, is most effective when platform extension authors can use it as a foundational tool to create extensions compatible with multiple providers.
Allow cluster managers of different types to share a single point inventory.

Non-Goals

Provide a standard reference implementation.
Define specific implementation details beyond general API behavior.
Offering functionalities related to multi-cluster orchestration.

Proposal

the API proposed by this KEP aims to

Provide a reliable, consistent, and automated approach for any multi-cluster application (framework, toolset) to discover available clusters and take actions accordingly, in a way similar to service discovery works in a microservice architecture. Through the inventory, the application can query for a list of clusters to access, or watch for an ever-flowing stream of cluster lifecycle events which the application can act upon timely, such as auto-scaling, upgrades, failures, and connectivity issues.
Provide a simple, clear interface for human operators to understand clusters under management.

Terminology

Cluster Inventory: A conceptual term referring to a collection of clusters.
Member Cluster: A kubernetes cluster that is part of a cluster inventory.
Cluster Manager: An entity that creates the ClusterProfile API object per member cluster, and keeps their status up-to-date. Each cluster manager MUST be identified with a unique name.
Each ClusterProfile resource SHOULD be managed by only one cluster manager. A cluster manager SHOULD have sufficient permission to access the member cluster to fetch the information so it can update the status of the ClusterProfile API resource.
ClusterProfile API Consumer: the person running the cluster managers or the person developing extensions for cluster managers for the purpose of workload distribution, operation management etc.

User Stories (Optional)

Story 1: Multicluster Workload Distribution

In this scenario, a unified API acts as a foundation for multicluster scheduling decisions across various clusters. This means that the API will provide a single point of contact to retrieve information that can be used to make informed scheduling decisions for workloads across multiple clusters. For instance, workload distribution tools like GitOps or Work API can leverage this API to make informed decisions about which cluster is best suited to handle a particular workload.

Examples of the multicluster scheduling includes:

As a user I want to run a workload on an EKS cluster that resides in us-east-1. I want to submit my workload if and only if a cluster satisfies that constraint.
As a user I want to run a workload on a cluster that has certain CRDs installed.
As a user I want to deploy a workload to my on-prem clusters.
As a user I want to run a workload close to the data source or end-users.
As a user I want to run a workload to the less busy cluster.
As a user I want to deploy workloads to Cluster API managed clusters based on cluster location and platform.

Story 2: Operations and Management

For multi-cluster administrators, the unified API provides a comprehensive view of all the clusters under their management. This includes verifying their memberships, understanding their status, capacity, and healthiness.

The API could also provide insights into the membership of each cluster, such as the number of nodes, their roles, and their current status. This can help administrators manage the clusters more effectively and ensure that they are always operating at optimal levels.

Story 3: Transparent to Consumers

The unified API ensures that consumers have the flexibility to choose different cluster management tools and switch among them as needed. Regardless of the tool they choose, they all use the same API to define clusters.

This means that consumers can switch from one cluster manager to another without having to worry about compatibility issues or learning a new API. This can significantly reduce the learning curve and make it easier for consumers to adopt new tools.

Moreover, the unified API can also provide a consistent user experience across different cluster managers. For example, if a consumer is used to a particular command or function in one tool, they can expect the same command or function to work in the same way in another tool. This can further enhance the usability and adoption of different cluster manager.

Notes/Constraints/Caveats

What's the relationship between the ClusterProfile API and Cluster Inventory?

The ClusterProfile API represents a single member cluster in a cluster inventory.

What's the relationship between a cluster inventory and clusterSet?

A cluster inventory may or may not represent a ClusterSet. A cluster inventory is considered a clusterSet if all its member clusters adhere to the namespace sameness principle. Note that a cluster can only be in one ClusterSet while there is not such restriction for a cluster inventory.

How should the API be consumed?

We recommend that all ClusterProfile objects within the same cluster inventory reside on a dedicated Kubernetes cluster (aka. the hub cluster). This approach allows consumers to have a single integration point to access all the information within a cluster inventory. Additionally, a multi-cluster aware controller can be run on the dedicated cluster to offer high-level functionalities over this inventory of clusters.

How should we organize ClusterProfile objects on a hub cluster?

While there are no strict requirements, we recommend making the ClusterProfile API a namespace-scoped object. This approach allows users to leverage Kubernetes' native namespace-based RBAC if they wish to restrict access to certain clusters within the inventory.

However, if a cluster inventory represents a ClusterSet, all its ClusterProfile objects MUST be part of the same clusterSet and namespace must be used as the grouping mechanism. In addition, the namespace must have a label with the key "clusterset.multicluster.x-k8s.io" and the value as the name of the clusterSet.

Uniqueness of the ClusterProfile object

While there are no strict requirements, we recommend that there is only one ClusterProfile object representing any member cluster on a hub cluster.

However, a ClusterProfile object can only be in one ClusterSet since the namespace sameness property is transitive, therefore it can only be in the namespace of that clusterSet if it is in a ClusterSet.

Risks and Mitigations

Design Details

We try to summarize the most common properties that should be able to represent a cluster and support the above use case at the minimum scope.

The target consumers of the API are users who manage the clusters and other tools to understand the clusters concept from the API for multicluster scheduling, workload distribution and cluster management. The API aims to provide the information for the consumers to answer the below questions:

Is the cluster under management?
How can I select a cluster?
Is the cluster healthy?
Does the cluster have certain capabilities or properties?
Does the cluster have sufficient resources?

Cluster Name

It is required that cluster name is unique for each cluster, and it should also be unique among different providers (cluster manager). It is cluster managers's responsibility to ensure the name uniqueness.

It's the responsibility of the cluster manager platform administrator to ensure cluster name uniqueness. The examples below serve more as recommendations than hard requirements, providing guidance on best practices.

Example 1

The metadata.name of the cluster should never be set upon creation. Only metadata.generateName can be set.

Example 2

The metadata.name and metadata.generateName must have prefix which should be the same as the spec.clusterManager.name. Different cluster manager must set a different value of spec.clusterManager.name when the cluster is created.

Spec

Display name

It is a human-readable name of the cluster set by the consumer of the cluster.

Cluster Manager

An immutable field set by a cluster manager when this cluster resources is created by it. Each cluster manager instance should set a different values to this field.

In addition, a predefined label with key "x-k8s.io/cluster-manager" needs to be added by the cluster manager upon creation. The value of the label MUST be the same as the name of the cluster manager. The purpose of this label is to make filter clusters from different cluster managers easier.

Status

Version

Kubernetes version of the cluster

Versions of the kubernetes can let consumers understand the capability of the kubernetes, such as what API is supported.

With recent conversations about kube-apiserver and enabled featureset version, it is possible to incorporate other version relating to the cluster, such as minimum kubelet version, maximum kubelet version, and enabled featureset version.

Properties

Name/value pairs to represent properties of the clusters. It could be a collection of ClusterProperty resources, but could also be info based on other implementations. The name of the cluster property can be predefined name from ClusterProperty resources and is allowed to be customized by different cluster managers.

Conditions

Record cluster’s healthiness condition and easy to extend, conforming to metav1.Condition format.

Predefined condition types:

Healthy conditions indicate the cluster is in a good state. (state: True/False/Unknown). Healthiness can have different meanings in different scenarios. We will have multiple condition types to define healthiness:
- ControlPlaneHealthy is to define whether the controlplane of the cluster is in the healthy state.
  - apiserver/healthz
  - controller-manager/healthz
  - scheduler/healthz
  - etcd/healthz
- <<[UNRESOLVED]>> AllNodesHealthy is to define if the nodes in the cluster are in a healthy state. If one node is not healthy, the status of NodeHealthy is set to false with the message indicating details. (todo tolerance, it should be configurable) It would be useful to collect the healthiness of other subsystems in the cluster, e.g. network, dns, storage, or ingress. However, it is not easy to collect that information in a common way with different implementations of network or storage providers. We decide not to include other subsystems healthiness conditions in the initial phase.
Joined: indicate the cluster is under management by the cluster manager. The status of the cluster SHOULD be updated by the cluster manager under this condition.

API Example

apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ClusterProfile
metadata:
 name: generated-cluster-name
 labels:
   x-k8s.io/cluster-manager: some-cluster-manager
spec:
  displayName: some-cluster
  clusterManager:
    name: some-cluster-manager
status:
 version:
   kubernetes: 1.28.0
 properties:
   - name: clusterset.k8s.io
     value: some-clusterset
   - name: location
     value: apac
 conditions:
   - type: ControlPlaneHealthy
     status: True
     lastTransitionTime: "2023-05-08T07:56:55Z"
     message: ""
   - type: Joined
     status: True
     lastTransitionTime: "2023-05-08T07:58:55Z"
     message: ""

Scalability implication

The API should provide summarized metadata of the cluster and relatively "static" cluster status. Dynamic data, e.g. cluster resource usage, should not be included in this API given it will bring heavy traffic to the control plane. A metrics collector system would be better suited in this scenario.

The QPS for each single cluster object is supposed to be less than 1/30 (30s per update on average). It should be achievable since the cluster properties in the status field are not supposed to be changed too frequenly. The Burst of each single cluster object is supposed to be 10 to handle initial join and sudden storm.

Test Plan

I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

This KEP proposes and out-of-tree CRD here that is not expected to integrate with any of the Kubernetes CI infrastructure. In addition, it explicitly provides only the CRD definition and generated clients for use by third party implementers, and does not provide a controller or any other binary with business logic to test. Therefore, we only expect unit test to validate the generated client and integration tests for API validation tests.

However, similar to other out-of-tree CRDs that serve third party implementers, such as Gateway API and MCS API, there is rationale for the project to provide conformance tests for implementers to use to confirm they adhere to the restrictions set forth in this KEP that are not otherwise enforced by the CRD definition.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

:

e2e tests

:

Graduation Criteria

Alpha

A CRD definition and generated client.
A dummy controller and unit test to validate the CRD and client.

Beta

Gather feedback from users during the Alpha stage to identify any issues, limitations, or areas for improvement. Address this feedback by making the necessary changes to the API and iterating on its design and functionality.
The API should support the addition of scheduling features, such as:
- Load balancing: Distribute workloads evenly across clusters based on their current load and capacity.
- Affinity and anti-affinity rules: Allow users to define rules for placing workloads on specific clusters or ensuring that certain workloads are not placed on the same cluster.
- Priority-based scheduling: Enable users to assign priorities to workloads, ensuring that higher-priority workloads are scheduled before lower-priority ones.
- Resource-based scheduling: Schedule workloads based on the availability of specific resources, such as CPU, memory, or storage, in each cluster.
The API should expose access information including but not limited to:
- APIServer endpoint url of the member cluster.
- Credential with limited access to the member cluster.
At least two providers and one consumer using ClusterProfile API.

GA

N examples of real-world usage
N installs
More rigorous forms of testing—e.g., downgrade tests and scalability tests
Allowing time for feedback
Stability: The API should demonstrate stability in terms of its reliability.
Functionality: The API should provide the necessary functionality for multicluster scheduling, including the ability to distribute workloads across clusters, manage cluster memberships, and monitor cluster health and capacity. This should be validated through a series of functional tests and real-world use cases.
Integration: Ensure that the API can be easily integrated with popular workload distribution tools, such as GitOps and Work API. This may involve developing plugins or extensions for these tools or providing clear guidelines on how to integrate them with the unified API.
Performance and Scalability: Conduct performance and scalability tests to ensure that the API can handle a large number of clusters and workloads without degrading its performance. This may involve stress testing the API with a high volume of requests or simulating large-scale deployments.

Note: Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, in back-to-back releases.

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No default Kubernetes behavior is currently planned to be based on this feature; it is designed to be used by the separately installed, out-of-tree, multicluster management providers and consumers.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, as this feature only describes a CRD, it can most directly be disabled by uninstalling the CRD.

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

As a dependency only for an out-of-tree component, there will not be e2e tests for feature enablement/disablement of this CRD in core Kubernetes. The e2e test can be provided by multicluster management providers who support this API.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Extending Cluster API `Cluster` resource

We also considered the possibility of extending the existing Cluster API's Cluster resource to accommodate our needs for describing clusters within a multi-cluster environment. However, this approach was ruled out due to the Cluster API's primary focus on cluster lifecycle management. Its tight coupling with cluster provisioning processes made it less suitable for scenarios where clusters are either provisioned through different methods or already exist. Furthermore, another distinction is the nature of the information each API conveys: the Cluster API's Cluster resource outlines the desired state of the cluster. In contrast, the new API is intended to reflect the actual state of the cluster, more similar to the Cluster.status in the Cluster API, but with a broader scope and intended for use in a multi-cluster context. This distinction also extends to the ownership of the resources; the Cluster API's Cluster is primarily owned by platform administrators focused on provisioning clusters, whereas the new API is designed to be owned by the cluster manager that created the cluster it represents.

ClusterProfile CRD scope

We had extensive discussions in SIG-Multicluster meetings about the appropriate scope for ClusterProfile resources, and ultimately decided that namespace scope would be more flexible than cluster scope while still retaining an adequate UX for simpler usage patterns. As a historical note, a prior attempt at organizing multiple clusters, the ClusterRegistry proposal, had proposed cluster-scoped resources but was met with pushback by potential adopters in part due to a desire to host multiple distinct registry lists on a single control plane, which would be far more straightforward with namespaced resources.

Global hub cluster for multiple clustersets

In this model, a single global hub cluster is used to manage multiple clustersets (a "Prod" clusterset and "Dev" clusterset in this illustration). For this use case, some means of segmenting the ClusterProfile resources into distinct groups for each clusterset is needed, and ideally should facilitate selecting all ClusterProfiles of a given clusterset. Because of this selection-targeting goal, setting clusterset membership within the spec of a ClusterProfile would not be sufficient. While setting a label such as the proposed clusterset.multicluster.x-k8s.io on the ClusterProfile resource (instead of a namespace) could be acceptable, managing multiple cluster-scoped ClusterProfile resources for multiple unrelated clustersets on a single global hub could quickly get cluttered. In addition to grouping clarity, namespace scoping could allow RBAC delegation for separate teams to manage resources for their own clustersets in isolation while still using a shared hub. The group of all clusters registered on the hub (potentially including clusters belonging to different clustersets or clusters not belonging to any clusterset) may represent a single "inventory" or multiple inventories, but such a definition is beyond the scope of this document and is permissible to be an undefined implementation detail.

Global hub cluster per clusterset

In this model, each "inventory" has a 1:1 mapping with a clusterset containing clusters in multiple regions. A cluster-scoped ClusterProfile CRD would be sufficient for this architecture, but it requires a proliferation of hub clusters, which may not be optimal. This model is still implementable with namespace-scoped ClusterProfile CRDs by writing them all to a single namespace, either the default namespace or a specific namespace configured in the cluster manager. The risk of placing resources in the wrong namespace would be somewhat minimal if following the suggested pattern of having ClusterProfile resources be written by a "manager" rather than authored by humans.

Regional hub cluster for multiple clustersets

In this model, "hub" clusters are limited to a regional scope (potentially for architectural limitations or performance optimizations) and each hub is used to manage clusters only from the local region, but which may belong to separate clustersets. If, as in the pictured example, clustersets still span multiple regions, some out-of-band synchronization mechanism between the regional hubs would likely be needed. This model has similar segmentation needs to the global hub model, just at a smaller scale.

Regional hub clusters per clusterset

This is creeping pretty far towards excessive cluster proliferation (and cross-region coordination overhead) purely for management needs (as opposed to actually running workloads), and would be more likely to be a reference or testing implementation than an architecture suitable for production scale.

Self-assembling clustersets

This is the model most suited to a cluster-scoped ClusterProfile resource. In contrast to the prior models discussed, in this approach the ClusterProfile CRD would be written directly to each "member" cluster. ClusterSet membership would either be established through peer-to-peer relationships, or managed by an external control plane. For ClusterSet security and integrity, a two-way handshake of some sort would be needed between the local cluster and each peer or the external control plane to ensure it is properly authorized to serve endpoints for exported services or import services from other clusters. While these approaches could be implemented with a namespace-scoped ClusterProfile CRD in the default or a designated namespace, misuse is most likely in this model, because the resource would be more likely to be authored by a human if using the peer-to-peer model. Due to the complexity and fragility concerns of managing clusterset membership in a peer-to-peer topology, an external control plane would likely be preferable. Assuming the external control plane does not support Kubernetes APIs (if it did, any of the "hub" models could be applied instead), it could still be possible to implement this model with a namespace-scoped ClusterProfile resource, but it is not recommended.

Workload placement across multiple clusters without cross-cluster service networking

In this model, a consumer of the Cluster Inventory API is looking to optimize workload placement to take advantage of excess capacity on existing managed clusters. These workloads may have specific hardware resource needs such as GPUs, but are typically "batch" jobs that do not require multi-cluster service networking to communicate with known services in a specific clusterset. The isolated nature of these jobs could allow them to be scheduled on many known clusters regardless of clusterset membership. A centralized hub which could register clusters in disparate clustersets or no clusterset and return a list of all known clusters from a single API call would be the most efficient for this consumer to query. Namespaced ClusterProfile CRDs on a global hub would be the best fit for this use case.

Workload placement into a specific clusterset

Within a single clusterset, a global workload placement controller may seek to balance capacity across multiple regions in response to demand, cost efficiency, or other factors. Querying a list of all clusters within a single clusterset should be possible to serve this use case, which is amenable to either cluster-scoped or namespaced-scoped ClusterProfile CRDs.

Files

README.md

Latest commit

History