Merge branch 'PaddlePaddle:develop' into develop

fightfat · Sep 18, 2024 · 6b6a2da · 6b6a2da
2 parents 4a3297f + 0540e97
commit 6b6a2da
Show file tree

Hide file tree

Showing 82 changed files with 3,199 additions and 1,535 deletions.
diff --git a/.gitignore b/.gitignore
@@ -126,6 +126,6 @@ FETCH_HEAD
 ./ppdiffusers/ppdiffusers/version.py
 
 # third party
-csrc/gpu/cutlass_kernels/cutlass
+csrc/third_party/
 dataset/
 output/
diff --git a/Makefile b/Makefile
@@ -46,7 +46,7 @@ unit-test:
 
 .PHONY: install
 install:
-	pip install paddlepaddle==0.0.0 -f https://www.paddlepaddle.org.cn/whl/linux/cpu-mkl/develop.html
+	pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
 	pip install -r requirements-dev.txt
 	pip install -r requirements.txt
 	pip install -r paddlenlp/experimental/autonlp/requirements.txt

diff --git a/README.md b/README.md
@@ -42,7 +42,8 @@
 
 ### <a href=#多硬件训推一体> 🔧 多硬件训推一体 </a>
 
-支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型训练和推理，套件接口支持硬件快速切换，大幅降低硬件切换研发成本。
+支持英伟达 GPU、昆仑 XPU、昇腾 NPU、燧原 GCU 和海光 DCU 等多个硬件的大模型和自然语言理解模型训练和推理，套件接口支持硬件快速切换，大幅降低硬件切换研发成本。
+当前支持的自然语言理解模型：[多硬件自然语言理解模型列表](./docs/model_zoo/model_list_multy_device.md)
 
 ### <a href=#高效易用的预训练> 🚀 高效易用的预训练 </a>
 
@@ -174,13 +175,14 @@ PaddleNLP 提供了方便易用的 Auto API，能够快速的加载模型和 Tok
 >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
 >>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
 >>> outputs = model.generate(**input_features, max_length=128)
->>> print(tokenizer.batch_decode(outputs[0]))
+>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
 ['我是一个AI语言模型，我可以回答各种问题，包括但不限于：天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗？']
 ```
 
 ### 大模型预训练
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -191,6 +193,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
 ### 大模型 SFT 精调
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # 如已clone或下载PaddleNLP可跳过
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm

diff --git a/README_en.md b/README_en.md
@@ -93,13 +93,14 @@ PaddleNLP provides a convenient and easy-to-use Auto API, which can quickly load
 >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", dtype="float16")
 >>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
 >>> outputs = model.generate(**input_features, max_length=128)
->>> print(tokenizer.batch_decode(outputs[0]))
+>>> print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))
 ['我是一个AI语言模型，我可以回答各种问题，包括但不限于：天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗？']
 ```
 
 ### Pre-training for large language model
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
 wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
@@ -110,6 +111,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py
 ### SFT finetuning forlarge language model
 
 ```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git && cd PaddleNLP # if cloned or downloaded, can skip this step
 mkdir -p llm/data && cd llm/data
 wget https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz && tar -zxvf AdvertiseGen.tar.gz
 cd .. # change folder to PaddleNLP/llm

diff --git a/csrc/README.md b/csrc/README.md
@@ -10,6 +10,12 @@ pip install -r requirements.txt
 
 ## 编译 Cuda 算子
 
+生成 FP8的 cutlass 算子(编译耗时较长)
+```shell
+python generate_code_gemm_fused_kernels.py
+```
+
+编译
 ```shell
 python setup_cuda.py install
 ```
@@ -20,9 +26,14 @@ python setup_cuda.py install
 2. 拉取代码:
     git clone -b v3.5.0 --single-branch https://github.com/NVIDIA/cutlass.git
 
-3. 将下载的 `cutlass` 目录放在 `csrc/gpu/cutlass_kernels/cutlass`下
+3. 将下载的 `cutlass` 目录放在 `csrc/third_party/cutlass`下
 
 4. 重新编译 Cuda 算子
 ```shell
 python setup_cuda.py install
 ```
+
+### FP8 GEMM 自动调优
+```shell
+sh tune_fp8_gemm.sh
+```