-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Enable multiple kubernetes contexts for failover #3968
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Took a quick look
sky/clouds/kubernetes.py
Outdated
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are changing the semantics in this PR to have region
for Kubernetes indicate the context, should we change this to return the current active context?
I.e., the region for Kubernetes now is always the name of a context. If config.yaml does not contain allowed_contexts
, we use the current active context name instead of the hardcoded _SINGLETON_REGION
. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noting that this may require some backward compatibility testing since cluster region would change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking to still using the same semantics as the original when allowed_contexts
is not specified. In this way, single k8s clusters will not need to understand another concept. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's three main benefits of switching to changing region to use context name:
- Keeps our YAMLs/CLI consistent irrespective of whether
allowed_contexts
is used. For example, say a user wrote a YAML like this when they were usingallowed_contexts: [dev, staging, prod]
:Now say they removed allowed_contexts or shared this with a colleague withoutresources: cloud: kubernetes region: dev
allowed_contexts
set. They would need to update the YAML toregion: kubernetes
or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML. - Helps [k8s] Show which Kubernetes context/cluster is used in
sky status
#3461. This is a common problem if the user switches contexts (even though they may use one at a time). - Showing the context name under region during
sky launch
helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.
Even for single k8s users, using a concrete region name instead of kubernetes
might be easier to understand. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point! I have changed the region to always show the context.
sky/clouds/kubernetes.py
Outdated
allowed_contexts = skypilot_config.get_nested( | ||
('kubernetes', 'allowed_contexts'), None) | ||
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[commentary, no action required] I am liking the idea of using regions (instead of clouds) to do multi-kubernetes. In the future, if we want to enable multi-k8s out of the box, we can simply return all contexts here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I found it more clear to have the following mapping:
k8s contexts -> local cloud config profiles.
Because both of them contains:
- the identity to use for accessing the resource pool (k8s: user + namespace; cloud config: account)
- the resource pool to look at (k8s: cluster; cloud config: project to use)
I think the current way is a simple workaround for now, but we may need to have a better design in the future. The main confusion with using region may come from: multiple context can map to the same k8s cluster with different namespace or user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress
, while others may need ports: loadbalancer
. Same ofr other fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I am updating the code for always showing context for region, I also realized that there are more places to be updated, especially the code for failover Kubernetes._get_feasible_launchable_resources
. If we have two clusters with different resource set, our failover will likely disregard all the Kubernetes clusters if the cluster without the resource is the current activate context.
Marking this PR to draft for now to fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Tested it and works nicely. Left some comments.
# If not specified, only the current active context is used for launching. | ||
# | ||
# When specified, SkyPilot will fail over through the contexts in the same | ||
# order as they are specified here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rewording and example:
# If not specified, only the current active context is used for launching. | |
# | |
# When specified, SkyPilot will fail over through the contexts in the same | |
# order as they are specified here. | |
# SkyPilot will try provisioning and fail over Kubernetes contexts in the same order | |
# as they are specified here. E.g., SkyPilot will try using context1 first. | |
# If it is out of resources or unreachable, it will fail over and try context2. | |
# | |
# If not specified, only the current active context is used for launching new clusters. |
sky/clouds/kubernetes.py
Outdated
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's three main benefits of switching to changing region to use context name:
- Keeps our YAMLs/CLI consistent irrespective of whether
allowed_contexts
is used. For example, say a user wrote a YAML like this when they were usingallowed_contexts: [dev, staging, prod]
:Now say they removed allowed_contexts or shared this with a colleague withoutresources: cloud: kubernetes region: dev
allowed_contexts
set. They would need to update the YAML toregion: kubernetes
or remove the region flag to make this work. Also, they would lose the "run this YAML only on dev cluster" directive from the YAML. - Helps [k8s] Show which Kubernetes context/cluster is used in
sky status
#3461. This is a common problem if the user switches contexts (even though they may use one at a time). - Showing the context name under region during
sky launch
helps reaffirm a) the cluster your job will run on and b) SkyPilot can switch between contexts now.
Even for single k8s users, using a concrete region name instead of kubernetes
might be easier to understand. wdyt?
sky/clouds/kubernetes.py
Outdated
allowed_contexts = skypilot_config.get_nested( | ||
('kubernetes', 'allowed_contexts'), None) | ||
if allowed_contexts is None: | ||
return cls._regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, probably need better solutions. I just realized many properties in the config may need to be updated in the near future to work well for multi-cluster (e.g., some contexts may need ports: ingress
, while others may need ports: loadbalancer
. Same ofr other fields).
We realized that we need to update the code for checking resource feasibility on a kubernetes cluster to support different context and make failover fully functional. Changed this PR to draft for now to fix that issue. |
This allows user to specify the following
~/.sky/config.yaml
to enable SkyPilot to failover through different kubernetes contexts.TODO:
region
vscontext
Tested (run the relevant ones):
bash format.sh
sky launch -c test --cloud kubernetes --cpus 4 echo hi
with two k8s clusters, one with nodes having less than 4 CPUs, one with nodes with more than 4 CPUs; it correctly failover through the first k8s cluster to the second oneallowed_contexts
, andsky exec
/sky launch
again on the existing SkyPilot clusterpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh