[k8s] Enable multiple kubernetes contexts for failover #3968

Michaelvll · 2024-09-22T04:39:35Z

This allows user to specify the following ~/.sky/config.yaml to enable SkyPilot to failover through different kubernetes contexts.

kubernetes:
  allowed_contexts:
    - kind-skypilot
    - gke_skypilot-xxx_us-central1-c_test-zhwu

TODO:

Check if we should improve the UX output for region vs context

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch -c test --cloud kubernetes --cpus 4 echo hi with two k8s clusters, one with nodes having less than 4 CPUs, one with nodes with more than 4 CPUs; it correctly failover through the first k8s cluster to the second one
- Remove the larger k8s cluster context name from allowed_contexts, and sky exec/ sky launch again on the existing SkyPilot cluster
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

romilbhardwaj

Thanks @Michaelvll! Took a quick look

romilbhardwaj · 2024-09-22T17:23:20Z

sky/clouds/kubernetes.py

+        if allowed_contexts is None:
+            return cls._regions


If we are changing the semantics in this PR to have region for Kubernetes indicate the context, should we change this to return the current active context?

I.e., the region for Kubernetes now is always the name of a context. If config.yaml does not contain allowed_contexts, we use the current active context name instead of the hardcoded _SINGLETON_REGION. wdyt?

Also noting that this may require some backward compatibility testing since cluster region would change

I am thinking to still using the same semantics as the original when allowed_contexts is not specified. In this way, single k8s clusters will not need to understand another concept. Wdyt?

I think there's three main benefits of switching to changing region to use context name:

Keeps our YAMLs/CLI consistent irrespective of whether allowed_contexts is used. For example, say a user wrote a YAML like this when they were using allowed_contexts: [dev, staging, prod]:
resources: cloud: kubernetes region: dev
Now say they removed allowed_contexts or shared this with a colleague without allowed_contexts set. They would need to update the YAML to region: kubernetes or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML.

Helps [k8s] Show which Kubernetes context/cluster is used in sky status #3461. This is a common problem if the user switches contexts (even though they may use one at a time).

Showing the context name under region during sky launch helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.

Even for single k8s users, using a concrete region name instead of kubernetes might be easier to understand. wdyt?

That is a good point! I have changed the region to always show the context.

docs/source/reference/config.rst

romilbhardwaj · 2024-09-22T17:37:53Z

sky/clouds/kubernetes.py

+        allowed_contexts = skypilot_config.get_nested(
+            ('kubernetes', 'allowed_contexts'), None)
+        if allowed_contexts is None:
+            return cls._regions


[commentary, no action required] I am liking the idea of using regions (instead of clouds) to do multi-kubernetes. In the future, if we want to enable multi-k8s out of the box, we can simply return all contexts here :)

Conceptually, I found it more clear to have the following mapping:
k8s contexts -> local cloud config profiles.
Because both of them contains:

the identity to use for accessing the resource pool (k8s: user + namespace; cloud config: account)

the resource pool to look at (k8s: cluster; cloud config: project to use)

I think the current way is a simple workaround for now, but we may need to have a better design in the future. The main confusion with using region may come from: multiple context can map to the same k8s cluster with different namespace or user.

Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

Yes, when I am updating the code for always showing context for region, I also realized that there are more places to be updated, especially the code for failover Kubernetes._get_feasible_launchable_resources. If we have two clusters with different resource set, our failover will likely disregard all the Kubernetes clusters if the cluster without the resource is the current activate context.

Marking this PR to draft for now to fix this issue.

romilbhardwaj

Thanks @Michaelvll! Tested it and works nicely. Left some comments.

sky/clouds/kubernetes.py

romilbhardwaj · 2024-09-23T20:24:36Z

docs/source/reference/config.rst

+    # If not specified, only the current active context is used for launching.
+    #
+    # When specified, SkyPilot will fail over through the contexts in the same
+    # order as they are specified here.


Some rewording and example:

Suggested change

# If not specified, only the current active context is used for launching.

#

# When specified, SkyPilot will fail over through the contexts in the same

# order as they are specified here.

# SkyPilot will try provisioning and fail over Kubernetes contexts in the same order

# as they are specified here. E.g., SkyPilot will try using context1 first.

# If it is out of resources or unreachable, it will fail over and try context2.

#

# If not specified, only the current active context is used for launching new clusters.

romilbhardwaj · 2024-09-23T20:38:39Z

sky/clouds/kubernetes.py

+        if allowed_contexts is None:
+            return cls._regions


I think there's three main benefits of switching to changing region to use context name:

Keeps our YAMLs/CLI consistent irrespective of whether allowed_contexts is used. For example, say a user wrote a YAML like this when they were using allowed_contexts: [dev, staging, prod]:
resources: cloud: kubernetes region: dev
Now say they removed allowed_contexts or shared this with a colleague without allowed_contexts set. They would need to update the YAML to region: kubernetes or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML.

Helps [k8s] Show which Kubernetes context/cluster is used in sky status #3461. This is a common problem if the user switches contexts (even though they may use one at a time).

Showing the context name under region during sky launch helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.

Even for single k8s users, using a concrete region name instead of kubernetes might be easier to understand. wdyt?

romilbhardwaj · 2024-09-23T20:43:11Z

sky/clouds/kubernetes.py

+        allowed_contexts = skypilot_config.get_nested(
+            ('kubernetes', 'allowed_contexts'), None)
+        if allowed_contexts is None:
+            return cls._regions


Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

…k8s-contexts

Michaelvll · 2024-09-24T08:11:36Z

We realized that we need to update the code for checking resource feasibility on a kubernetes cluster to support different context and make failover fully functional. Changed this PR to draft for now to fix that issue.

Michaelvll added 2 commits September 22, 2024 03:24

wip

c959930

Fix

3ff073e

Michaelvll requested a review from romilbhardwaj September 22, 2024 04:39

Michaelvll added 2 commits September 22, 2024 04:44

format

c8a2331

format

deee261

romilbhardwaj reviewed Sep 22, 2024

View reviewed changes

Fix context and namespace used

02b3f91

romilbhardwaj reviewed Sep 23, 2024

View reviewed changes

Michaelvll added 2 commits September 24, 2024 06:21

Merge branch 'master' of github.com:skypilot-org/skypilot into multi-…

1a8c79b

…k8s-contexts

update

c35f176

Michaelvll marked this pull request as draft September 24, 2024 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Enable multiple kubernetes contexts for failover #3968

[k8s] Enable multiple kubernetes contexts for failover #3968

Michaelvll commented Sep 22, 2024 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Sep 22, 2024

romilbhardwaj Sep 22, 2024

Michaelvll Sep 23, 2024

romilbhardwaj Sep 23, 2024

Michaelvll Sep 24, 2024

romilbhardwaj Sep 22, 2024

Michaelvll Sep 23, 2024 •

edited

Loading

romilbhardwaj Sep 23, 2024

Michaelvll Sep 24, 2024

romilbhardwaj left a comment

romilbhardwaj Sep 23, 2024

romilbhardwaj Sep 23, 2024

romilbhardwaj Sep 23, 2024

Michaelvll commented Sep 24, 2024

-    # If not specified, only the current active context is used for launching.
-    #
-    # When specified, SkyPilot will fail over through the contexts in the same
-    # order as they are specified here.
+    # SkyPilot will try provisioning and fail over Kubernetes contexts in the same order
+    # as they are specified here. E.g., SkyPilot will try using context1 first.
+    # If it is out of resources or unreachable, it will fail over and try context2.
+    #
+    # If not specified, only the current active context is used for launching new clusters.

[k8s] Enable multiple kubernetes contexts for failover #3968

Are you sure you want to change the base?

[k8s] Enable multiple kubernetes contexts for failover #3968

Conversation

Michaelvll commented Sep 22, 2024 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Sep 24, 2024

Michaelvll commented Sep 22, 2024 •

edited

Loading

Michaelvll Sep 23, 2024 •

edited

Loading