[Docs] K8s Debugging Docs (#3128)

* debugging docs * fixes * update gpu instructions * services debugging * services debugging * comments wip * comments wip * comments * fixes
skypilot-org · Mar 2, 2024 · a63a56a · a63a56a
1 parent 095acab
commit a63a56a
Show file tree

Hide file tree

Showing 9 changed files with 418 additions and 15 deletions.
diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst
@@ -255,3 +255,4 @@ Kubernetes support is under active development. Some features are in progress an
    :hidden:
 
    kubernetes-setup
+   kubernetes-troubleshooting
diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -165,12 +165,14 @@ such as `kubeadm <https://kubernetes.io/docs/setup/production-environment/tools/
 `Rancher <https://ranchermanager.docs.rancher.com/v2.5/pages-for-subheaders/kubernetes-clusters-in-rancher-setup>`_.
 Please follow their respective guides to deploy your Kubernetes cluster.
 
+.. _kubernetes-setup-gpusupport:
+
 Setting up GPU support
 ~~~~~~~~~~~~~~~~~~~~~~
 If your Kubernetes cluster has Nvidia GPUs, ensure that:
 
 1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
-2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.
+2. Each node in your cluster is labelled with the GPU type. This labelling can be done using `SkyPilot's GPU labelling script <automatic-gpu-labelling_>`_ or by manually adding a label of the format ``skypilot.co/accelerator: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`.
 
 .. tip::
     You can check if GPU operator is installed and the ``nvidia`` runtime is set as default by running:
@@ -188,7 +190,11 @@ If your Kubernetes cluster has Nvidia GPUs, ensure that:
 
 .. note::
 
-    GPU labels are case-sensitive. Ensure that the GPU name is lowercase if you are using the ``skypilot.co/accelerators`` label.
+    GPU labels are case-sensitive. Ensure that the GPU name is lowercase if you are using the ``skypilot.co/accelerator`` label.
+
+.. note::
+
+    GPU labelling is not required on GKE clusters - SkyPilot will automatically use GKE provided labels. However, you will still need to install `drivers <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_.
 
 .. _automatic-gpu-labelling:
 
@@ -203,13 +209,9 @@ We provide a convenience script that automatically detects GPU types and labels
 
  Created GPU labeler job for node ip-192-168-54-76.us-west-2.compute.internal
  Created GPU labeler job for node ip-192-168-93-215.us-west-2.compute.internal
- GPU labeling started - this may take a few minutes to complete.
+ GPU labeling started - this may take 10 min or more to complete.
  To check the status of GPU labeling jobs, run `kubectl get jobs --namespace=kube-system -l job=sky-gpu-labeler`
- You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerators: <gpu_name>`.
-
-.. note::
-
-    GPU labelling is not required on GKE clusters - SkyPilot will automatically use GKE provided labels. However, you will still need to install `drivers <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_.
+ You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`.
 
 .. note::
 
@@ -229,6 +231,14 @@ You can also check the GPUs available on your nodes by running:
 
     $ sky show-gpus --cloud kubernetes
 
+.. tip::
+
+    If automatic GPU labelling fails, you can manually label your nodes with the GPU type. Use the following command to label your nodes:
+
+    .. code-block:: console
+
+        $ kubectl label nodes <node-name> skypilot.co/accelerator=<gpu_name>
+
 .. _kubernetes-setup-onprem-distro-specific:
 
 Notes for specific Kubernetes distributions
@@ -392,6 +402,10 @@ To use this mode:
     kubernetes:
       ports: ingress
 
+.. tip::
+
+    For RKE2 and K3s, the pre-installed Nginx ingress is not correctly configured by default. Follow the `bare-metal installation instructions <https://kubernetes.github.io/ingress-nginx/deploy/#bare-metal-clusters/>`_ to set up the Nginx ingress controller correctly.
+
 When using this mode, SkyPilot creates an ingress resource and a ClusterIP service for each port opened. The port can be accessed externally by using the Ingress URL plus a path prefix of the form :code:`/skypilot/{pod_name}/{port}`.
 
 Use :code:`sky status --endpoints <cluster>` to view the full endpoint URLs for all ports.
@@ -453,3 +467,7 @@ Note that this dashboard can only be accessed from the machine where the ``kubec
     `Kubernetes documentation <https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/>`_
     for more information on how to set up access control for the dashboard.
 
+Troubleshooting Kubernetes Setup
+--------------------------------
+
+If you encounter issues while setting up your Kubernetes cluster, please refer to the :ref:`troubleshooting guide <kubernetes-troubleshooting>` to diagnose and fix issues.
diff --git a/docs/source/reference/kubernetes/kubernetes-troubleshooting.rst b/docs/source/reference/kubernetes/kubernetes-troubleshooting.rst
@@ -0,0 +1,298 @@
+.. _kubernetes-troubleshooting:
+
+Kubernetes Troubleshooting
+==========================
+
+If you're unable to run SkyPilot tasks on your Kubernetes cluster, this guide will help you debug common issues.
+
+If this guide does not help resolve your issue, please reach out to us on `Slack <https://slack.skypilot.co>`_ or `GitHub <http://github.com/skypilot-org/skypilot>`_.
+
+.. _kubernetes-troubleshooting-basic:
+
+Verifying basic setup
+---------------------
+
+Step A0 - Is Kubectl functional?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Are you able to run :code:`kubectl get nodes` without any errors?
+
+.. code-block:: bash
+
+    $ kubectl get nodes
+    # This should list all the nodes in your cluster.
+
+Make sure at least one node is in the :code:`Ready` state.
+
+If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.
+
+.. note::
+    The :code:`kubectl` command should not require any additional flags or environment variables to run.
+    If it requires additional flags, you must encode all configuration in your kubeconfig file at :code:`~/.kube/config`.
+    For example, :code:`--context`, :code:`--token`, :code:`--certificate-authority`, etc. should all be configured directly in the kubeconfig file.
+
+Step A1 - Can you create pods and services?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As a sanity check, we will now try creating a simple pod running a HTTP server and a service to verify that your cluster and it's networking is functional.
+
+We will use the SkyPilot default image :code:`us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest` to verify that the image can be pulled from the registry.
+
+.. code-block:: bash
+
+    $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+
+    # Verify that the pod is running by checking the status of the pod
+    $ kubectl get pod skytest
+
+    # Try accessing the HTTP server in the pod by port-forwarding it to your local machine
+    $ kubectl port-forward svc/skytest-svc 8080:8080
+
+    # Open a browser and navigate to http://localhost:8080 to see an index page
+
+    # Once you have verified that the pod is running, you can delete it
+    $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+
+If your pod does not start, check the pod's logs for errors with :code:`kubectl describe skytest` and :code:`kubectl logs skytest`.
+
+Step A2 - Can SkyPilot access your cluster?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Run :code:`sky check` to verify that SkyPilot can access your cluster.
+
+.. code-block:: bash
+
+    $ sky check
+    # Should show `Kubernetes: Enabled`
+
+If you see an error, ensure that your kubeconfig file at :code:`~/.kube/config` is correctly set up.
+
+
+Step A3 - Can you launch a SkyPilot task?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, try running a simple hello world task to verify that SkyPilot can launch tasks on your cluster.
+
+.. code-block:: bash
+
+    $ sky launch -y -c mycluster --cloud kubernetes -- "echo hello world"
+    # Task should run and print "hello world" to the console
+
+    # Once you have verified that the task runs, you can delete it
+    $ sky down -y mycluster
+
+If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output,
+starting with "To view detailed progress: ...".
+
+.. _kubernetes-troubleshooting-gpus:
+
+Checking GPU support
+--------------------
+
+If you are trying to run a GPU task, make sure you have followed the instructions in :ref:`kubernetes-setup-gpusupport` to set up your cluster for GPU support.
+
+In this section, we will verify that your cluster has GPU support and that SkyPilot can access it.
+
+Step B0 - Is your cluster GPU-enabled?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Run :code:`kubectl describe nodes` or the below snippet to verify that your nodes have :code:`nvidia.com/gpu` resources.
+
+.. code-block:: bash
+
+    $ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity}'
+    # Look for the `nvidia.com/gpu` field under resources in the output. It should show the number of GPUs available for each node.
+
+If you do not see the :code:`nvidia.com/gpu` field, your cluster likely does not have the Nvidia GPU operator installed.
+Please follow the instructions in :ref:`kubernetes-setup-gpusupport` to install the Nvidia GPU operator.
+Note that GPU operator installation can take several minutes, and you may see 0 capacity for ``nvidia.com/gpu`` resources until the installation is complete.
+
+.. tip::
+
+    If you are using GKE, refer to :ref:`kubernetes-setup-gke` to install the appropriate drivers.
+
+Step B1 - Can you run a GPU pod?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Verify if GPU operator is installed and the ``nvidia`` runtime is set as default by running:
+
+.. code-block:: bash
+
+    $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
+
+    # Verify that the pod is running by checking the status of the pod
+    $ kubectl get pod skygputest
+
+    $ kubectl logs skygputest
+    # Should print the nvidia-smi output to the console
+
+    # Once you have verified that the pod is running, you can delete it
+    $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
+
+If the pod status is pending, make the :code:`nvidia.com/gpu` resources available on your nodes in the previous step. You can debug further by running :code:`kubectl describe pod skygputest`.
+
+If the logs show `nvidia-smi: command not found`, likely the ``nvidia`` runtime is not set as default. Please install the Nvidia GPU operator and make sure the ``nvidia`` runtime is set as default.
+For example, for RKE2, refer to instructions on `Nvidia GPU Operator installation with Helm on RKE2 <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#custom-configuration-for-runtime-containerd>`_ to set the ``nvidia`` runtime as default.
+
+
+Step B2 - Are your nodes labeled correctly?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+SkyPilot requires nodes to be labeled with the correct GPU type to run GPU tasks. Run :code:`kubectl get nodes -o json` to verify that your nodes are labeled correctly.
+
+.. tip::
+
+    If you are using GKE, your nodes should be automatically labeled with :code:`cloud.google.com/gke-accelerator`. You can skip this step.
+
+.. code-block:: bash
+
+    $ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}'
+    # Look for the `skypilot.co/accelerator` label in the output. It should show the GPU type for each node.
+
+If you do not see the `skypilot.co/accelerator` label, your nodes are not labeled correctly. Please follow the instructions in :ref:`kubernetes-setup-gpusupport` to label your nodes.
+
+Step B3 - Can SkyPilot see your GPUs?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Run :code:`sky check` to verify that SkyPilot can see your GPUs.
+
+.. code-block:: bash
+
+    $ sky check
+    # Should show `Kubernetes: Enabled` and should not print any warnings about GPU support.
+
+    # List the available GPUs in your cluster
+    $ sky show-gpus --cloud kubernetes
+
+Step B4 - Try launching a dummy GPU task
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, try running a simple GPU task to verify that SkyPilot can launch GPU tasks on your cluster.
+
+.. code-block:: bash
+
+    # Replace the GPU type from the sky show-gpus output in the task launch command
+    $ sky launch -y -c mygpucluster --cloud kubernetes --gpu <gpu-type>:1 -- "nvidia-smi"
+
+    # Task should run and print the nvidia-smi output to the console
+
+    # Once you have verified that the task runs, you can delete it
+    $ sky down -y mygpucluster
+
+If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output,
+starting with "To view detailed progress: ...".
+
+.. _kubernetes-troubleshooting-ports:
+
+Verifying ports support
+-----------------------
+
+If you are trying to run a task that requires ports to be opened, make sure you have followed the instructions in :ref:_kubernetes-ports`
+to configure SkyPilot and your cluster to use the desired method (LoadBalancer service or Nginx Ingress) for port support.
+
+In this section, we will first verify that your cluster has ports support and services launched by SkyPilot can be accessed.
+
+.. _kubernetes-troubleshooting-ports-loadbalancer:
+
+Step C0 - Verifying LoadBalancer service setup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you are using LoadBalancer services for ports support, follow the below steps to verify that your cluster is configured correctly.
+
+.. tip::
+
+    If you are using Nginx Ingress for ports support, skip to :ref:`kubernetes-troubleshooting-ports-nginx`.
+
+Does your cluster support LoadBalancer services?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To verify that your cluster supports LoadBalancer services, we will create an example service and verify that it gets an external IP.
+
+.. code-block:: bash
+
+    $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+    $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
+
+    # Verify that the service gets an external IP
+    # Note: It may take some time on cloud providers to change from pending to an external IP
+    $ watch kubectl get svc skytest-loadbalancer
+
+    # Once you get an IP, try accessing the HTTP server by curling the external IP
+    $ IP=$(kubectl get svc skytest-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+    $ curl $IP:8080
+
+    # Once you have verified that the service is accessible, you can delete it
+    $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+    $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
+
+If your service does not get an external IP, check the service's status with :code:`kubectl describe svc skytest-loadbalancer`. Your cluster may not support LoadBalancer services.
+
+
+.. _kubernetes-troubleshooting-ports-nginx:
+
+Step C0 - Verifying Nginx Ingress setup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you are using Nginx Ingress for ports support, refer to :ref:`kubernetes-ingress` for instructions on how to install and configure Nginx Ingress.
+
+.. tip::
+
+    If you are using LoadBalancer services for ports support, you can skip this section.
+
+Does your cluster support Nginx Ingress?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To verify that your cluster supports Nginx Ingress, we will create an example ingress.
+
+.. code-block:: bash
+
+        $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+        $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test.yaml
+
+        # Get the external IP of the ingress using the externalIPs field or the loadBalancer field
+        $ IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.spec.externalIPs[*]}') && [ -z "$IP" ] && IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
+        $ echo "Got IP: $IP"
+        $ curl http://$IP/skytest
+
+        # Once you have verified that the service is accessible, you can delete it
+        $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
+        $ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test_svc.yaml
+
+If your IP is not acquired, check the service's status with :code:`kubectl describe svc ingress-nginx-controller -n ingress-nginx`.
+Your ingress's service must be of type :code:`LoadBalancer` or :code:`NodePort` and must have an external IP.
+
+Is SkyPilot configured to use Nginx Ingress?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Take a look at your :code:`~/.sky/config.yaml` file to verify that the :code:`ports: ingress` section is configured correctly.
+
+.. code-block:: bash
+
+    $ cat ~/.sky/config.yaml
+
+    # Output should contain:
+    #
+    # kubernetes:
+    #   ports: ingress
+
+If not, add the :code:`ports: ingress` section to your :code:`~/.sky/config.yaml` file.
+
+.. _kubernetes-troubleshooting-ports-dryrun:
+
+Step C1 - Verifying SkyPilot can launch services
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, try running a simple task with a service to verify that SkyPilot can launch services on your cluster.
+
+.. code-block:: bash
+
+    $ sky launch -y -c myserver --cloud kubernetes --ports 8080 -- "python -m http.server 8080"
+
+    # Obtain the endpoint of the service
+    $ sky status --endpoint 8080 myserver
+
+    # Try curling the endpoint to verify that the service is accessible
+    $ curl <endpoint>
+
+If you are unable to get the endpoint from SkyPilot,
+consider running :code:`kubectl describe services` or :code:`kubectl describe ingress` to debug it.
diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py
@@ -32,7 +32,7 @@
 }
 NO_GPU_ERROR_MESSAGE = 'No GPUs found in Kubernetes cluster. \
 If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs \
-(e.g., skypilot.co/accelerators) are setup correctly. \
+(e.g., skypilot.co/accelerator) are setup correctly. \
 To further debug, run: sky check.'
 
 # TODO(romilb): Add links to docs for configuration instructions when ready.