Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Enable multiple kubernetes contexts for failover #3968

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Sep 22, 2024

This allows user to specify the following ~/.sky/config.yaml to enable SkyPilot to failover through different kubernetes contexts.

kubernetes:
  allowed_contexts:
    - kind-skypilot
    - gke_skypilot-xxx_us-central1-c_test-zhwu

TODO:

  • Check if we should improve the UX output for region vs context

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test --cloud kubernetes --cpus 4 echo hi with two k8s clusters, one with nodes having less than 4 CPUs, one with nodes with more than 4 CPUs; it correctly failover through the first k8s cluster to the second one
    • Remove the larger k8s cluster context name from allowed_contexts, and sky exec/ sky launch again on the existing SkyPilot cluster
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Took a quick look

Comment on lines 119 to 120
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are changing the semantics in this PR to have region for Kubernetes indicate the context, should we change this to return the current active context?

I.e., the region for Kubernetes now is always the name of a context. If config.yaml does not contain allowed_contexts, we use the current active context name instead of the hardcoded _SINGLETON_REGION. wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noting that this may require some backward compatibility testing since cluster region would change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to still using the same semantics as the original when allowed_contexts is not specified. In this way, single k8s clusters will not need to understand another concept. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's three main benefits of switching to changing region to use context name:

  1. Keeps our YAMLs/CLI consistent irrespective of whether allowed_contexts is used. For example, say a user wrote a YAML like this when they were using allowed_contexts: [dev, staging, prod]:
     resources:
       cloud: kubernetes
       region: dev
    Now say they removed allowed_contexts or shared this with a colleague without allowed_contexts set. They would need to update the YAML to region: kubernetes or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML.
  2. Helps [k8s] Show which Kubernetes context/cluster is used in sky status #3461. This is a common problem if the user switches contexts (even though they may use one at a time).
  3. Showing the context name under region during sky launch helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.

Even for single k8s users, using a concrete region name instead of kubernetes might be easier to understand. wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point! I have changed the region to always show the context.

docs/source/reference/config.rst Show resolved Hide resolved
allowed_contexts = skypilot_config.get_nested(
('kubernetes', 'allowed_contexts'), None)
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[commentary, no action required] I am liking the idea of using regions (instead of clouds) to do multi-kubernetes. In the future, if we want to enable multi-k8s out of the box, we can simply return all contexts here :)

Copy link
Collaborator Author

@Michaelvll Michaelvll Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I found it more clear to have the following mapping:
k8s contexts -> local cloud config profiles.
Because both of them contains:

  1. the identity to use for accessing the resource pool (k8s: user + namespace; cloud config: account)
  2. the resource pool to look at (k8s: cluster; cloud config: project to use)

I think the current way is a simple workaround for now, but we may need to have a better design in the future. The main confusion with using region may come from: multiple context can map to the same k8s cluster with different namespace or user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I am updating the code for always showing context for region, I also realized that there are more places to be updated, especially the code for failover Kubernetes._get_feasible_launchable_resources. If we have two clusters with different resource set, our failover will likely disregard all the Kubernetes clusters if the cluster without the resource is the current activate context.

Marking this PR to draft for now to fix this issue.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Tested it and works nicely. Left some comments.

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved
Comment on lines +489 to +492
# If not specified, only the current active context is used for launching.
#
# When specified, SkyPilot will fail over through the contexts in the same
# order as they are specified here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rewording and example:

Suggested change
# If not specified, only the current active context is used for launching.
#
# When specified, SkyPilot will fail over through the contexts in the same
# order as they are specified here.
# SkyPilot will try provisioning and fail over Kubernetes contexts in the same order
# as they are specified here. E.g., SkyPilot will try using context1 first.
# If it is out of resources or unreachable, it will fail over and try context2.
#
# If not specified, only the current active context is used for launching new clusters.

Comment on lines 119 to 120
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's three main benefits of switching to changing region to use context name:

  1. Keeps our YAMLs/CLI consistent irrespective of whether allowed_contexts is used. For example, say a user wrote a YAML like this when they were using allowed_contexts: [dev, staging, prod]:
     resources:
       cloud: kubernetes
       region: dev
    Now say they removed allowed_contexts or shared this with a colleague without allowed_contexts set. They would need to update the YAML to region: kubernetes or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML.
  2. Helps [k8s] Show which Kubernetes context/cluster is used in sky status #3461. This is a common problem if the user switches contexts (even though they may use one at a time).
  3. Showing the context name under region during sky launch helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.

Even for single k8s users, using a concrete region name instead of kubernetes might be easier to understand. wdyt?

allowed_contexts = skypilot_config.get_nested(
('kubernetes', 'allowed_contexts'), None)
if allowed_contexts is None:
return cls._regions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress, while others may need ports: loadbalancer. Same ofr other fields).

@Michaelvll Michaelvll marked this pull request as draft September 24, 2024 08:10
@Michaelvll
Copy link
Collaborator Author

We realized that we need to update the code for checking resource feasibility on a kubernetes cluster to support different context and make failover fully functional. Changed this PR to draft for now to fix that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants