skypilot-org · LDLINGLINGLING · Aug 21, 2024 · Aug 22, 2024 · Michaelvll · Aug 22, 2024
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@
 
 ----
 :fire: *News* :fire:
+- [Aug, 2024] Serve [**MiniCPM**](https://github.com/OpenBMB/MiniCPM) on your infra: [**example**](./llm/minicpm/)
 - [Jul, 2024] [Finetune](./llm/llama-3_1-finetuning/) and [serve](./llm/llama-3_1/) **Llama 3.1** on your infra
 - [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
 - [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
@@ -156,6 +157,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
 <!-- Keep this section in sync with index.rst in SkyPilot Docs -->
 Runnable examples:
 - LLMs on SkyPilot
+  - [MiniCPM](./llm/minicpm/)
   - [Llama 3.1 finetuning](./llm/llama-3_1-finetuning/) and [serving](./llm/llama-3_1/)
   - [GPT-2 via `llm.c`](./llm/gpt-2/)
   - [Llama 3](./llm/llama-3/)

diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst
@@ -35,6 +35,7 @@ Contents
    :caption: LLM Models
 
    Mixtral (Mistral AI) <llms/mixtral>
+   MiniCPM (openbmb) <llms/minicpm>
    Mistral 7B (Mistral AI) <https://docs.mistral.ai/self-deployment/skypilot/>
    DBRX (Databricks) <llms/dbrx>
    Llama-2 (Meta) <llms/llama-2>

diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst
@@ -87,6 +87,7 @@ Runnable examples:
   * `Databricks DBRX <https://github.com/skypilot-org/skypilot/tree/master/llm/dbrx>`_
   * `Gemma <https://github.com/skypilot-org/skypilot/tree/master/llm/gemma>`_
   * `Mixtral 8x7B <https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral>`_; `Mistral 7B <https://docs.mistral.ai/self-deployment/skypilot>`_ (from official Mistral team)
+  * `MiniCPM <https://github.com/skypilot-org/skypilot/tree/master/llm/minicpm>`_;(from official Openbmb team)
   * `Code Llama <https://github.com/skypilot-org/skypilot/tree/master/llm/codellama/>`_
   * `vLLM: Serving LLM 24x Faster On the Cloud <https://github.com/skypilot-org/skypilot/tree/master/llm/vllm>`_ (from official vLLM team)
   * `SGLang: Fast and Expressive LLM Serving On the Cloud <https://github.com/skypilot-org/skypilot/tree/master//llm/sglang/>`_ (from official SGLang team)

diff --git a/llm/minicpm/README.md b/llm/minicpm/README.md
@@ -0,0 +1,79 @@
+
+
+📰 **Update (26 April 2024) -** SkyPilot now also supports the [**MiniCPM-2B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model! Use [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) to serve the 110B model.
+
+📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**MiniCPM-1B**](https://openbmb.vercel.app/?category=Chinese+Blog/) model! 
+
+<p align="center">
+    <img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/>
+</p>
+
+## References
+* [MiniCPM blog](https://openbmb.vercel.app/?category=Chinese+Blog/)
+
+## Why use SkyPilot to deploy over commercial hosted solutions?
+
+* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds.
+* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups.
+* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint 
+* Everything stays in your cloud account (your VMs & buckets)
+* Completely private - no one else sees your chat history
+
+
+## Running your own Minicpm with SkyPilot
+
+After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own minicpm model on vLLM with SkyPilot in 1-click:
+
+1. Start serving MiniCPM on a single instance with any available GPU in the list specified in [serve-2b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-2b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-1b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-1b.yaml) or [serve-cpmv2_6.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/minicpm/serve-cpmv2_6.yaml) for a multimodal model):
+
+```bash
+sky launch -c cpm serve-110b.yaml
+```
+2. Send a request to the endpoint for completion:
+```bash
+IP=$(sky status --ip qwen)
+
+curl http://$IP:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "openbmb/MiniCPM-2B-sft-bf16",
+      "prompt": "My favorite food is",
+      "max_tokens": 512
+  }' | jq -r '.choices[0].text'
+```
+
+3. Send a request for chat completion:
+```bash
+curl http://$IP:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "openbmb/MiniCPM-1B-sft-bf16",
+      "messages": [
+        {
+          "role": "system",
+          "content": "You are a helpful and honest chat expert."
+        },
+        {
+          "role": "user",
+          "content": "What is the best food?"
+        }
+      ],
+      "max_tokens": 512
+  }' | jq -r '.choices[0].message.content'
+```
+
+
+## **Optional:** Accessing Qwen with Chat GUI
+
+It is also possible to access the Qwen service with a GUI using [vLLM](https://github.com/vllm-project/vllm).
+
+1. Start the chat web UI (change the `--env` flag to the model you are running):
+```bash
+sky launch -c cpm-gui ./gui.yaml --env MODEL_NAME='openbmb/MiniCPM-2B-sft-bf16' --env ENDPOINT=$(sky serve status --endpoint cpm)
+```
+
+2. Then, we can access the GUI at the returned gradio link:
+```
+| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+```
+
diff --git a/llm/minicpm/gui.yaml b/llm/minicpm/gui.yaml
@@ -0,0 +1,44 @@
+# Starts a GUI server that connects to the Qwen OpenAI API server.
+#
+# Refer to llm/qwen/README.md for more details.
+#
+# Usage:
+#
+#  1. If you have a endpoint started on a cluster (sky launch):
+#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000`
+#  2. If you have a SkyPilot Service started (sky serve up) called qwen:
+#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)`
+#
+# After the GUI server is started, you will see a gradio link in the output and
+# you can click on it to open the GUI.
+
+envs:
+  ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
+  MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16
+
+resources:
+  cpus: 2
+
+setup: |
+  conda activate cpm
+  if [ $? -ne 0 ]; then
+    conda create -n cpm python=3.10 -y
+    conda activate cpm
+  fi
+
+  # Install Gradio for web UI.
+  pip install gradio openai
+
+run: |
+  conda activate cpm
+  export PATH=$PATH:/sbin
+  WORKER_IP=$(hostname -I | cut -d' ' -f1)
+  CONTROLLER_PORT=21001
+  WORKER_PORT=21002
+
+  echo 'Starting gradio server...'
+  git clone https://github.com/vllm-project/vllm.git || true
+  python vllm/examples/gradio_openai_chatbot_webserver.py \
+    -m $MODEL_NAME \
+    --port 8811 \
+    --model-url http://$ENDPOINT/v1 | tee ~/gradio.log
diff --git a/llm/minicpm/serve-1b.yaml b/llm/minicpm/serve-1b.yaml
@@ -0,0 +1,41 @@
+envs:
+  MODEL_NAME: openbmb/MiniCPM-1B-sft-bf16
+
+service:
+  # Specifying the path to the endpoint to check the readiness of the replicas.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+    initial_delay_seconds: 1200
+  # How many replicas to manage.
+  replicas: 2
+
+
+resources:
+  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+  disk_tier: best
+  ports: 8000
+
+setup: |
+  conda activate cpm
+  if [ $? -ne 0 ]; then
+    conda create -n cpm python=3.10 -y
+    conda activate cpm
+  fi
+  pip install vllm==0.5.4
+  pip install flash-attn==2.5.9.post1
+
+run: |
+  conda activate cpm
+  export PATH=$PATH:/sbin
+  python -u -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --model $MODEL_NAME \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-num-seqs 16 | tee ~/openai_api_server.log
+
diff --git a/llm/minicpm/serve-2b.yaml b/llm/minicpm/serve-2b.yaml
@@ -0,0 +1,40 @@
+envs:
+  MODEL_NAME: openbmb/MiniCPM-2B-sft-bf16
+
+service:
+  # Specifying the path to the endpoint to check the readiness of the replicas.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+    initial_delay_seconds: 1200
+  # How many replicas to manage.
+  replicas: 2
+
+
+resources:
+  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+  disk_tier: best
+  ports: 8000
+
+setup: |
+  conda activate cpm
+  if [ $? -ne 0 ]; then
+    conda create -n cpm python=3.10 -y
+    conda activate cpm
+  fi
+  pip install vllm==0.5.4
+  pip install flash-attn==2.5.9.post1
+
+run: |
+  conda activate cpm
+  export PATH=$PATH:/sbin
+  python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --model $MODEL_NAME \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-model-len 1024 | tee ~/openai_api_server.log
diff --git a/llm/minicpm/serve-cpmv2_6.yaml b/llm/minicpm/serve-cpmv2_6.yaml
@@ -0,0 +1,40 @@
+envs:
+  MODEL_NAME: openbmb/MiniCPM-V-2_6
+
+service:
+  # Specifying the path to the endpoint to check the readiness of the replicas.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+    initial_delay_seconds: 1200
+  # How many replicas to manage.
+  replicas: 2
+
+
+resources:
+  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+  disk_tier: best
+  ports: 8000
+
+setup: |
+  conda activate cpm
+  if [ $? -ne 0 ]; then
+    conda create -n cpm python=3.10 -y
+    conda activate cpm
+  fi
+  pip install vllm==0.5.4
+  pip install flash-attn==2.5.9.post1
+
+run: |
+  conda activate cpm
+  export PATH=$PATH:/sbin
+  python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --model $MODEL_NAME \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-model-len 1024 | tee ~/openai_api_server.log