Skip to content

Commit

Permalink
[Docs] K8s Debugging Docs (#3128)
Browse files Browse the repository at this point in the history
* debugging docs

* fixes

* update gpu instructions

* services debugging

* services debugging

* comments wip

* comments wip

* comments

* fixes
  • Loading branch information
romilbhardwaj committed Mar 2, 2024
1 parent 095acab commit a63a56a
Show file tree
Hide file tree
Showing 9 changed files with 418 additions and 15 deletions.
1 change: 1 addition & 0 deletions docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -255,3 +255,4 @@ Kubernetes support is under active development. Some features are in progress an
:hidden:

kubernetes-setup
kubernetes-troubleshooting
34 changes: 26 additions & 8 deletions docs/source/reference/kubernetes/kubernetes-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,12 +165,14 @@ such as `kubeadm <https://kubernetes.io/docs/setup/production-environment/tools/
`Rancher <https://ranchermanager.docs.rancher.com/v2.5/pages-for-subheaders/kubernetes-clusters-in-rancher-setup>`_.
Please follow their respective guides to deploy your Kubernetes cluster.

.. _kubernetes-setup-gpusupport:

Setting up GPU support
~~~~~~~~~~~~~~~~~~~~~~
If your Kubernetes cluster has Nvidia GPUs, ensure that:

1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.
2. Each node in your cluster is labelled with the GPU type. This labelling can be done using `SkyPilot's GPU labelling script <automatic-gpu-labelling_>`_ or by manually adding a label of the format ``skypilot.co/accelerator: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`.

.. tip::
You can check if GPU operator is installed and the ``nvidia`` runtime is set as default by running:
Expand All @@ -188,7 +190,11 @@ If your Kubernetes cluster has Nvidia GPUs, ensure that:

.. note::

GPU labels are case-sensitive. Ensure that the GPU name is lowercase if you are using the ``skypilot.co/accelerators`` label.
GPU labels are case-sensitive. Ensure that the GPU name is lowercase if you are using the ``skypilot.co/accelerator`` label.

.. note::

GPU labelling is not required on GKE clusters - SkyPilot will automatically use GKE provided labels. However, you will still need to install `drivers <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_.

.. _automatic-gpu-labelling:

Expand All @@ -203,13 +209,9 @@ We provide a convenience script that automatically detects GPU types and labels
Created GPU labeler job for node ip-192-168-54-76.us-west-2.compute.internal
Created GPU labeler job for node ip-192-168-93-215.us-west-2.compute.internal
GPU labeling started - this may take a few minutes to complete.
GPU labeling started - this may take 10 min or more to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs --namespace=kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerators: <gpu_name>`.
.. note::

GPU labelling is not required on GKE clusters - SkyPilot will automatically use GKE provided labels. However, you will still need to install `drivers <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_.
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`.
.. note::

Expand All @@ -229,6 +231,14 @@ You can also check the GPUs available on your nodes by running:
$ sky show-gpus --cloud kubernetes
.. tip::

If automatic GPU labelling fails, you can manually label your nodes with the GPU type. Use the following command to label your nodes:

.. code-block:: console
$ kubectl label nodes <node-name> skypilot.co/accelerator=<gpu_name>
.. _kubernetes-setup-onprem-distro-specific:

Notes for specific Kubernetes distributions
Expand Down Expand Up @@ -392,6 +402,10 @@ To use this mode:
kubernetes:
ports: ingress
.. tip::

For RKE2 and K3s, the pre-installed Nginx ingress is not correctly configured by default. Follow the `bare-metal installation instructions <https://kubernetes.github.io/ingress-nginx/deploy/#bare-metal-clusters/>`_ to set up the Nginx ingress controller correctly.

When using this mode, SkyPilot creates an ingress resource and a ClusterIP service for each port opened. The port can be accessed externally by using the Ingress URL plus a path prefix of the form :code:`/skypilot/{pod_name}/{port}`.

Use :code:`sky status --endpoints <cluster>` to view the full endpoint URLs for all ports.
Expand Down Expand Up @@ -453,3 +467,7 @@ Note that this dashboard can only be accessed from the machine where the ``kubec
`Kubernetes documentation <https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/>`_
for more information on how to set up access control for the dashboard.

Troubleshooting Kubernetes Setup
--------------------------------

If you encounter issues while setting up your Kubernetes cluster, please refer to the :ref:`troubleshooting guide <kubernetes-troubleshooting>` to diagnose and fix issues.
298 changes: 298 additions & 0 deletions docs/source/reference/kubernetes/kubernetes-troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
.. _kubernetes-troubleshooting:

Kubernetes Troubleshooting
==========================

If you're unable to run SkyPilot tasks on your Kubernetes cluster, this guide will help you debug common issues.

If this guide does not help resolve your issue, please reach out to us on `Slack <https://slack.skypilot.co>`_ or `GitHub <http://github.com/skypilot-org/skypilot>`_.

.. _kubernetes-troubleshooting-basic:

Verifying basic setup
---------------------

Step A0 - Is Kubectl functional?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Are you able to run :code:`kubectl get nodes` without any errors?

.. code-block:: bash
$ kubectl get nodes
# This should list all the nodes in your cluster.
Make sure at least one node is in the :code:`Ready` state.

If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.

.. note::
The :code:`kubectl` command should not require any additional flags or environment variables to run.
If it requires additional flags, you must encode all configuration in your kubeconfig file at :code:`~/.kube/config`.
For example, :code:`--context`, :code:`--token`, :code:`--certificate-authority`, etc. should all be configured directly in the kubeconfig file.

Step A1 - Can you create pods and services?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As a sanity check, we will now try creating a simple pod running a HTTP server and a service to verify that your cluster and it's networking is functional.

We will use the SkyPilot default image :code:`us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest` to verify that the image can be pulled from the registry.

.. code-block:: bash
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
# Verify that the pod is running by checking the status of the pod
$ kubectl get pod skytest
# Try accessing the HTTP server in the pod by port-forwarding it to your local machine
$ kubectl port-forward svc/skytest-svc 8080:8080
# Open a browser and navigate to http://localhost:8080 to see an index page
# Once you have verified that the pod is running, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
If your pod does not start, check the pod's logs for errors with :code:`kubectl describe skytest` and :code:`kubectl logs skytest`.

Step A2 - Can SkyPilot access your cluster?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run :code:`sky check` to verify that SkyPilot can access your cluster.

.. code-block:: bash
$ sky check
# Should show `Kubernetes: Enabled`
If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.


Step A3 - Can you launch a SkyPilot task?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, try running a simple hello world task to verify that SkyPilot can launch tasks on your cluster.

.. code-block:: bash
$ sky launch -y -c mycluster --cloud kubernetes -- "echo hello world"
# Task should run and print "hello world" to the console
# Once you have verified that the task runs, you can delete it
$ sky down -y mycluster
If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output,
starting with "To view detailed progress: ...".

.. _kubernetes-troubleshooting-gpus:

Checking GPU support
--------------------

If you are trying to run a GPU task, make sure you have followed the instructions in :ref:`kubernetes-setup-gpusupport` to set up your cluster for GPU support.

In this section, we will verify that your cluster has GPU support and that SkyPilot can access it.

Step B0 - Is your cluster GPU-enabled?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run :code:`kubectl describe nodes` or the below snippet to verify that your nodes have :code:`nvidia.com/gpu` resources.

.. code-block:: bash
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity}'
# Look for the `nvidia.com/gpu` field under resources in the output. It should show the number of GPUs available for each node.
If you do not see the :code:`nvidia.com/gpu` field, your cluster likely does not have the Nvidia GPU operator installed.
Please follow the instructions in :ref:`kubernetes-setup-gpusupport` to install the Nvidia GPU operator.
Note that GPU operator installation can take several minutes, and you may see 0 capacity for ``nvidia.com/gpu`` resources until the installation is complete.

.. tip::

If you are using GKE, refer to :ref:`kubernetes-setup-gke` to install the appropriate drivers.

Step B1 - Can you run a GPU pod?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Verify if GPU operator is installed and the ``nvidia`` runtime is set as default by running:

.. code-block:: bash
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
# Verify that the pod is running by checking the status of the pod
$ kubectl get pod skygputest
$ kubectl logs skygputest
# Should print the nvidia-smi output to the console
# Once you have verified that the pod is running, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
If the pod status is pending, make the :code:`nvidia.com/gpu` resources available on your nodes in the previous step. You can debug further by running :code:`kubectl describe pod skygputest`.

If the logs show `nvidia-smi: command not found`, likely the ``nvidia`` runtime is not set as default. Please install the Nvidia GPU operator and make sure the ``nvidia`` runtime is set as default.
For example, for RKE2, refer to instructions on `Nvidia GPU Operator installation with Helm on RKE2 <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#custom-configuration-for-runtime-containerd>`_ to set the ``nvidia`` runtime as default.


Step B2 - Are your nodes labeled correctly?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

SkyPilot requires nodes to be labeled with the correct GPU type to run GPU tasks. Run :code:`kubectl get nodes -o json` to verify that your nodes are labeled correctly.

.. tip::

If you are using GKE, your nodes should be automatically labeled with :code:`cloud.google.com/gke-accelerator`. You can skip this step.

.. code-block:: bash
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}'
# Look for the `skypilot.co/accelerator` label in the output. It should show the GPU type for each node.
If you do not see the `skypilot.co/accelerator` label, your nodes are not labeled correctly. Please follow the instructions in :ref:`kubernetes-setup-gpusupport` to label your nodes.

Step B3 - Can SkyPilot see your GPUs?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run :code:`sky check` to verify that SkyPilot can see your GPUs.

.. code-block:: bash
$ sky check
# Should show `Kubernetes: Enabled` and should not print any warnings about GPU support.
# List the available GPUs in your cluster
$ sky show-gpus --cloud kubernetes
Step B4 - Try launching a dummy GPU task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, try running a simple GPU task to verify that SkyPilot can launch GPU tasks on your cluster.

.. code-block:: bash
# Replace the GPU type from the sky show-gpus output in the task launch command
$ sky launch -y -c mygpucluster --cloud kubernetes --gpu <gpu-type>:1 -- "nvidia-smi"
# Task should run and print the nvidia-smi output to the console
# Once you have verified that the task runs, you can delete it
$ sky down -y mygpucluster
If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output,
starting with "To view detailed progress: ...".

.. _kubernetes-troubleshooting-ports:

Verifying ports support
-----------------------

If you are trying to run a task that requires ports to be opened, make sure you have followed the instructions in :ref:_kubernetes-ports`
to configure SkyPilot and your cluster to use the desired method (LoadBalancer service or Nginx Ingress) for port support.

In this section, we will first verify that your cluster has ports support and services launched by SkyPilot can be accessed.

.. _kubernetes-troubleshooting-ports-loadbalancer:

Step C0 - Verifying LoadBalancer service setup
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you are using LoadBalancer services for ports support, follow the below steps to verify that your cluster is configured correctly.

.. tip::

If you are using Nginx Ingress for ports support, skip to :ref:`kubernetes-troubleshooting-ports-nginx`.

Does your cluster support LoadBalancer services?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To verify that your cluster supports LoadBalancer services, we will create an example service and verify that it gets an external IP.

.. code-block:: bash
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
# Verify that the service gets an external IP
# Note: It may take some time on cloud providers to change from pending to an external IP
$ watch kubectl get svc skytest-loadbalancer
# Once you get an IP, try accessing the HTTP server by curling the external IP
$ IP=$(kubectl get svc skytest-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
$ curl $IP:8080
# Once you have verified that the service is accessible, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
If your service does not get an external IP, check the service's status with :code:`kubectl describe svc skytest-loadbalancer`. Your cluster may not support LoadBalancer services.


.. _kubernetes-troubleshooting-ports-nginx:

Step C0 - Verifying Nginx Ingress setup
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you are using Nginx Ingress for ports support, refer to :ref:`kubernetes-ingress` for instructions on how to install and configure Nginx Ingress.

.. tip::

If you are using LoadBalancer services for ports support, you can skip this section.

Does your cluster support Nginx Ingress?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To verify that your cluster supports Nginx Ingress, we will create an example ingress.

.. code-block:: bash
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test.yaml
# Get the external IP of the ingress using the externalIPs field or the loadBalancer field
$ IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.spec.externalIPs[*]}') && [ -z "$IP" ] && IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
$ echo "Got IP: $IP"
$ curl http://$IP/skytest
# Once you have verified that the service is accessible, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test_svc.yaml
If your IP is not acquired, check the service's status with :code:`kubectl describe svc ingress-nginx-controller -n ingress-nginx`.
Your ingress's service must be of type :code:`LoadBalancer` or :code:`NodePort` and must have an external IP.

Is SkyPilot configured to use Nginx Ingress?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Take a look at your :code:`~/.sky/config.yaml` file to verify that the :code:`ports: ingress` section is configured correctly.

.. code-block:: bash
$ cat ~/.sky/config.yaml
# Output should contain:
#
# kubernetes:
# ports: ingress
If not, add the :code:`ports: ingress` section to your :code:`~/.sky/config.yaml` file.

.. _kubernetes-troubleshooting-ports-dryrun:

Step C1 - Verifying SkyPilot can launch services
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, try running a simple task with a service to verify that SkyPilot can launch services on your cluster.

.. code-block:: bash
$ sky launch -y -c myserver --cloud kubernetes --ports 8080 -- "python -m http.server 8080"
# Obtain the endpoint of the service
$ sky status --endpoint 8080 myserver
# Try curling the endpoint to verify that the service is accessible
$ curl <endpoint>
If you are unable to get the endpoint from SkyPilot,
consider running :code:`kubectl describe services` or :code:`kubectl describe ingress` to debug it.
2 changes: 1 addition & 1 deletion sky/provision/kubernetes/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
}
NO_GPU_ERROR_MESSAGE = 'No GPUs found in Kubernetes cluster. \
If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs \
(e.g., skypilot.co/accelerators) are setup correctly. \
(e.g., skypilot.co/accelerator) are setup correctly. \
To further debug, run: sky check.'

# TODO(romilb): Add links to docs for configuration instructions when ready.
Expand Down
Loading

0 comments on commit a63a56a

Please sign in to comment.