Skip to content

Commit

Permalink
definition of the HostClaim resource
Browse files Browse the repository at this point in the history
The HostClaim resource is introduced to address multi-tenancy in
Metal3 and the definition of hybrid clusters with cluster-api.

Introduces four scenarios representing the three main use cases and
how pivot is made.

Presentation of the design will be addressed in another merge request.

Co-authored-by: Pierre Crégut <pierre.cregut@orange.com>
Co-authored-by: Laurent Roussarie <laurent.roussarie@orange.com>
Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
  • Loading branch information
pierrecregut and lroussarie committed Apr 2, 2024
1 parent fa98964 commit 40604f3
Showing 1 changed file with 297 additions and 0 deletions.
297 changes: 297 additions & 0 deletions design/hostclaim-multitenancy-and-hybrid-clusters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
<!--
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
-->

# HostClaim: multi-tenancy and hybrid clusters


## Status

provisional

## Summary

We introduce a new Custom Resource (named HostClaim) which will facilitate
the creation of multiple clusters for different tenants. It also provides
a framework for building clusters with different kind of compute resource:
baremetal servers but also virtual machines hosted in private or public cloud.

A HostClaim decouples the client need from the actual implementation of the
compute resource: it establishes a security boundary and provides a way to
migrate nodes between different kind of compute resources.

A HostClaim expresses that one wants to start a given
OS image with an initial configuration (typically cloud-init or ignition
configuration files) on a compute resource that meets a set of requirements
(host selectors). These requirements can be interpreted either as labels
for a recyclable resource (such as a bare-metal server) or as characteristics
for a disposable resource (such as a virtual machine created for the workload).
The status and meta-data of the HostClaim provide the necessary information
for the end user to define and manage his workload on the compute resource,
but they do not grant full control over the resource (typically, BMC
credentials of servers are not exposed to the tenant).

## Motivation

So far, the primary use case of cluster-api-baremetal is the creation of a
single target cluster from a temporary management cluster. The pivot process
transfers the resources describing the target cluster from the management
cluster to the target cluster. Once the pivot process is complete, the target
cluster takes over all the servers. It can scale based on its workload but it
cannot share its servers with other clusters.

There is another model where a single management cluster is used to create and
manage several clusters across a set of bare-metal servers. This is the focus
of the [Sylva Project](https://sylvaproject.org/) of the Linux foundation.
Another example is [Das Schiff](https://github.com/telekom/das-schiff).

One of the issue encountered today is that the compute resources
(BareMetalHost) and the cluster definition (Cluster, MachineDeployment,
Machines, Metal3Machines, etc.) must be in the same namespace. Since the goal
is to share the compute resources, this means that a single namespace is used
for all resources. Consequently, unless very complex access control
rules are defined, cluster administrators have visibility over all clusters
and full control over the servers as the credentials are stored in the same
namespace.

The solution so far is to completely proxy the access to the Kubernetes
resources that define the clusters.

Another unrelated problem is that Cluster-API has been designed
to define clusters using homogeneous compute resources: it is challenging to
define a cluster with both bare-metal servers and virtual machines in a private
or public cloud.
This [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html)
proposes several approaches but none is entirely satisfactory.

On the other hand, workloads can be easily defined in terms of OS images and
initial configurations and standards such as qcow or cloud-init have emerged
and are used by various infrastructure technologies.
Due to the unique challenges of managing bare-metal, the Metal3 project has
developed a set of abstractions and tools that could be used in different
settings. The main mechanism is the selection process that occurs between
the Metal3Machine and the BareMetalHost which assigns a disposable workload
(being a Kubernetes node) to a recyclable compute resource (a
server).

This proposal introduces a new resource called HostClaim that solves
both problems by decoupling the definition of the workload performed by
the Metal3 machine controller from the actual compute resource.
This resource acts as both a security boundary and a way to hide the
implementation details of the compute resource.

### Goals

* Split responsibilities between infrastructure teams, who manage servers, and
cluster administrators, who create/update/scale baremetal clusters deployed
on those servers, using traditional Kubernetes RBAC to ensure isolation.
* Provide a framework where cluster administrators can consume compute
resources that are not baremetal servers, as long as they offer similar APIs,
using the cluster-api-provider-metal3 to manage the life-cycle of those
resources.
* Define a resource where a user can request a compute resource to execute
an arbitrary workload described by an OS image and an initial configuration.
The user does not need to know exactly which resource is used and may not
have full control over this resource (typically no BMC access).


### Non-Goals

* How to implement HostClaim for specific compute resources that are not
BareMetalHost.
* Discovery of which capabilities are exposed by the cluster.
Which kind of compute resources are available and the semantics of the
selectors are not handled.
* Compute resource quotas. The HostClaim resource should make it possible to
develop a framework to limit the number/size of compute resources allocated
to a tenant, similar to how quotas work for pods. However, the specification
of such a framework will be addressed in another design document.
* Pivoting client clusters resources (managed clusters that are not the
initial cluster).

## Proposal


### User Stories

#### As a user I would like to execute a workload on an arbitrary server
The OS image is available in qcow format on a remote server at ``url_image``.
It supports cloud-init and a script can launch the workload at boot time
(e.g., a systemd service).

The cluster offers bare-metal as a service using Metal3 baremetal-operator.
However, as a regular user, I am not allowed to directly access the definitions
of the servers. All servers are labeled with an ``infra-kind`` label whose
value depends on the characteristics of the computer.

* I create a resource with the following content:
```yaml
apiVersion: metal3.io/v1alpha1
kind: HostClaim
metadata:
name: my-host
spec:
online: false
kind: baremetal

hostSelector:
matchLabels:
infra-kind: medium
```
* After a while, the system associates the claim with a real server, and
the resource's status is populated with the following information:
```yaml
status:
addresses:
- address: 192.168.133.33
type: InternalIP
- address: fe80::6be8:1f93:7f65:59cf%ens3
type: InternalIP
- address: localhost.localdomain
type: Hostname
- address: localhost.localdomain
type: InternalDNS
bootMACAddress: "52:54:00:01:00:05"
conditions:
- lastTransitionTime: "2024-03-29T14:33:19Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-03-29T14:33:19Z"
status: "True"
type: AssociateBMH
lastUpdated: "2024-03-29T14:33:19Z"
nics:
- MAC: "52:54:00:01:00:05"
ip: 192.168.133.33
name: ens3
```
* I also examine the annotations and labels of the HostClaim resource. They
have been enriched with information from the BareMetalHost resource.
* I create three secrets in the same namespace ``my-user-data``,
``my-meta-data``, and ``my-network-data``. I use the information from the
status and meta data to customize the scripts they contain.
* I modify the HostClaim to point to those secrets and start the server:
```yaml
apiVersion: metal3.io/v1alpha1
kind: HostClaim
metadata:
name: my-host
spec:
online: true
image:
checksum: https://url_image.qcow2.md5
url: https://url_image.qcow2
format: qcow2
userData:
name: my-user-data
networkData:
name: my-network-data
kind: baremetal
hostSelector:
matchLabels:
infra-kind: medium
```
* The workload is launched. When the machine is fully provisioned, the boolean
field ready in the status becomes true. I can stop the server by changing
the online status. I can also perform a reboot by targeting specific
annotations in the reserved ``host.metal3.io`` domain.
* When I destroy the host, the association is broken and another user can take
over the server.
#### As an infrastructure administrator I would like to host several isolated clusters
All the servers in the data-center are registered as BareMetalHost in one or
several namespaces under the control of the infrastructure manager. Namespaces
are created for each tenant of the infrastructure. They create
standard cluster definitions in those namespaces. The only difference with
standard baremetal cluster definitions is the presence of a ``kind`` field in
the Metal3Machine templates. The value of this field is set to ``baremetal``.
When the cluster is started, a HostClaim is created for each Metal3Machine
associated to the cluster. The ``hostSelector`` and ``kind`` fields are
inherited from the Metal3Machine. They are used to define the BareMetalHost
associated with the cluster. The associated BareMetalHost is not in the same
namespace as the HostClaim. The exact definition of the BareMetalHost remains
hidden from the cluster user, but parts of its status and metadata are copied
back to the Host namespace. With this information,
the data template controller has enough details to compute the different
secrets (userData, metaData and networkData) associated to the Metal3Machine.
Those secrets are linked to the HostClaim and, ultimately, to the
BareMetalHost.
When the cluster is modified, new Machine and Metal3Machine resources replace
the previous ones. The HostClaims follow the life-cycle of the Metal3Machines
and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
recycled and are bound to new HostClaims, potentially belonging to other
clusters.
#### As a cluster administrator I would like to build a cluster with different kind of nodes
This scenario assumes that:
* the cloud technologies CT<sub>i</sub> use qcow images and cloud-init to
define workloads.
* Clouds C<sub>i</sub> implementing CT<sub>i</sub> are accessible through
credentials and endpoints described in a resource Cr<sub>i</sub>.
* a HostClaim controller exists for each CT<sub>i</sub>. Compute resource can
be parameterize through arguments in the HostClaim arg<sub>ij</sub>.
One can build a cluster where each machine deployment MD<sub>i</sub>
contains Metal3Machine templates referring to kind CT<sub>i</sub>.
The arguments identify the credentials to use (CR<sub>i</sub>)
```yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3MachineTemplate
metadata:
name: md-i
spec:
dataTemplate:
name: dt-i
kind: CT_i
args:
arg_i1: v1
...
arg_ik: vk
hostSelector:
matchLabels:
...
image:
checksum: https://image_url.qcow2.md5
format: qcow2
url: https://image_url.qcow2.md5
```
The Metal3Machine controllers will create HostClaims with different kinds
handled by different controllers creating the different compute resources.
Connectivity must be established between the different subnets where each
controller creates its compute resources.
The argument extension is not strictly necessary but is cleaner than using
matchLabels in specific domains to convey information to controllers.
Controllers for disposable resources such as virtual machine typically do not
use hostSelectors. Controllers for a "bare-metal as a service" service
may use selectors.
#### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
The bootstrap process can be performed as usual from an ephemeral cluster
(e.g., a KinD cluster). The constraint that all resources must be in the same
namespace (Cluster and BareMetalHost resources) must be respected. The
BareMetalHost should be marked as movable.
The only difference with the behavior without Host is the presence of an
intermediate Host resource but the chain of resources is kept during the
transfer and the pause annotation is used to stop Ironic.
Because this operation is only performed by the administrator of a cluster
manager, the fact that the cluster definition and the BareMetalHosts are in
the same namespace should not be an issue.
The tenant clusters cannot be pivoted which can be expected from a security
point of vue as it would give the bare-metal servers credentials to the
tenants. Partial pivot can be achieved with the help of HostClaim replicating
HostClaims on other clusters but the specification of the associated
controller is beyond the scope of this specification.
## Design Details
TBD.

0 comments on commit 40604f3

Please sign in to comment.