Skip to content

Commit

Permalink
[GPT] Fix some typos. (PaddlePaddle#390)
Browse files Browse the repository at this point in the history
* [GPT] doc fix.

* [GPT] No import paddlenlp.ops .
  • Loading branch information
ZHUI committed May 16, 2021
1 parent 3bf7b2f commit fb7c288
Show file tree
Hide file tree
Showing 9 changed files with 45 additions and 47 deletions.
2 changes: 1 addition & 1 deletion docs/model_zoo.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ PaddleNLP提供了丰富的模型结构,包含经典的RNN类模型结构,
| [ERNIE-Tiny](../examples/text_classification/pretrained_models) | 百度自研的小型化ERNIE网络结构,采用浅层Transformer,加宽隐层参数,中文subword粒度词表结合蒸馏的方法使模型相比SOTA Before BERT 提升8.35%, 速度提升4.3倍。 |
| [ERNIE-GEN](../examples/text_generation/ernie-gen) | [ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation](https://arxiv.org/abs/2001.11314) ERNIE-GEN是百度发布的生成式预训练模型,通过Global-Attention的方式解决训练和预测曝光偏差的问题,同时使用Multi-Flow Attention机制来分别进行Global和Context信息的交互,同时通过片段生成的方式来增加语义相关性。 |
| [ERNIESage](../examples/text_graph/erniesage)| ERNIESage(ERNIE SAmple aggreGatE) 通过Graph(图)来构建自身节点和邻居节点的连接关系,将自身节点和邻居节点的关系构建成一个关联样本输入到ERNIE中,ERNIE作为聚合函数 (Aggregators) 来表征自身节点和邻居节点的语义关系,最终强化图中节点的语义表示。|
| [GPT-2](../examples/language_model/gpt) |[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) |
| [GPT](../examples/language_model/gpt) |[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) |
| [ELECTRA](../examples/language_model/electra/) | [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555) |
| [XLNet](../examples/language_model/xlnet/) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
| [RoBERTa](../examples/text_classification/pretrained_models) | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) |
Expand Down
2 changes: 1 addition & 1 deletion docs/model_zoo/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
|[ERNIE](https://arxiv.org/abs/1904.09223)|ErnieTokenizer<br>ErnieTinyTokenizer|ErnieModel<br> ErnieForQuestionAnswering<br> ErnieForSequenceClassification<br> ErnieForTokenClassification | `ernie-1.0`<br> `ernie-tiny`<br> `ernie-2.0-en`<br> `ernie-2.0-large-en`|
|[ERNIE-GEN](https://arxiv.org/abs/2001.11314)|ErnieTokenizer| ErnieForGeneration|`ernie-gen-base-en`<br>`ernie-gen-large-en`<br>`ernie-gen-large-en-430g`|
| ERNIE-CTM | ErnieCtmTokenizer | ErnieCtmModel<br> ErnieCtmWordtagModel | `ernie-ctm` |
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[GPT](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[RoBERTa](https://arxiv.org/abs/1907.11692)|RobertaTokenizer| RobertaModel<br>RobertaForQuestionAnswering<br>RobertaForSequenceClassification<br>RobertaForTokenClassification| `roberta-wwm-ext`<br> `roberta-wwm-ext-large`<br> `rbt3`<br> `rbtl3`|
| [BigBird](https://arxiv.org/abs/2007.14062) | BigBirdTokenizer | BigBirdModel<br> BigBirdForSequenceClassification<br> BigBirdForPretraining | `bigbird-base-uncased` |
|[ELECTRA](https://arxiv.org/abs/2003.10555) | ElectraTokenizer| ElectraModel<br>ElectraForSequenceClassification<br>ElectraForTokenClassification<br>|`electra-small`<br> `electra-base`<br> `electra-large`<br> `chinese-electra-small`<br> `chinese-electra-base`<br>|
Expand Down
2 changes: 1 addition & 1 deletion docs/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
| [BERT](https://arxiv.org/abs/1810.04805) | BertTokenizer|BertModel<br> BertForQuestionAnswering<br> BertForSequenceClassification<br>BertForTokenClassification| `bert-base-uncased`<br> `bert-large-uncased` <br>`bert-base-multilingual-uncased` <br>`bert-base-cased`<br> `bert-base-chinese`<br> `bert-base-multilingual-cased`<br> `bert-large-cased`<br> `bert-wwm-chinese`<br> `bert-wwm-ext-chinese` |
|[ERNIE](https://arxiv.org/abs/1904.09223)|ErnieTokenizer<br>ErnieTinyTokenizer|ErnieModel<br> ErnieForQuestionAnswering<br> ErnieForSequenceClassification<br> ErnieForTokenClassification | `ernie-1.0`<br> `ernie-tiny`<br> `ernie-2.0-en`<br> `ernie-2.0-large-en`|
|[ERNIE-GEN](https://arxiv.org/abs/2001.11314)|ErnieTokenizer| ErnieForGeneration|`ernie-gen-base-en`<br>`ernie-gen-large-en`<br>`ernie-gen-large-en-430g`|
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[GPT](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[RoBERTa](https://arxiv.org/abs/1907.11692)|RobertaTokenizer| RobertaModel<br>RobertaForQuestionAnswering<br>RobertaForSequenceClassification<br>RobertaForTokenClassification| `roberta-wwm-ext`<br> `roberta-wwm-ext-large`<br> `rbt3`<br> `rbtl3`|
|[ELECTRA](https://arxiv.org/abs/2003.10555) | ElectraTokenizer| ElectraModel<br>ElectraForSequenceClassification<br>ElectraForTokenClassification<br>|`electra-small`<br> `electra-base`<br> `electra-large`<br> `chinese-electra-small`<br> `chinese-electra-base`<br>|
|[XLNet](https://arxiv.org/abs/1906.08237)| XLNetTokenizer| XLNetModel<br> XLNetForSequenceClassification<br> XLNetForTokenClassification |`xlnet-base-cased`<br> `xlnet-large-cased`<br> `chinese-xlnet-base`<br> `chinese-xlnet-mid`<br> `chinese-xlnet-large`|
Expand Down
33 changes: 18 additions & 15 deletions examples/language_model/gpt/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# GPT-2
# GPT

## 模型介绍
[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)(Language Models are Unsupervised Multitask Learners) 以[Transformer](https://arxiv.org/abs/1706.03762) 解码器为网络基本组件,使用自回归的方式在大规模无标注文本语料上进行预训练得到的语言生成模型。
GPT-[2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)/[3](https://arxiv.org/pdf/2005.14165.pdf) 是以[Transformer](https://arxiv.org/abs/1706.03762) 解码器为网络基本组件,使用自回归的方式在大规模无标注文本语料上进行预训练得到的语言生成模型。

本项目是语言模型 GPT-2 的 PaddlePaddle 实现, 包含模型训练,预测等内容。下是本例的简要目录结构及说明:
本项目是语言模型 GPT 的 PaddlePaddle 实现, 包含模型训练,预测等内容。下是本例的简要目录结构及说明:

```text
.
Expand All @@ -13,8 +13,8 @@
├── decompress.sh # 数据集解压脚本
├── deploy/ # 模型部署的inference脚本
├── export_model.py # 导出预测部署的模型脚本
├── predict.py # 生成文本示例demo
├── lr.py # 学习率控制
├── predict.py # 生成文本示例demo
├── README.md # 文档
├── run_eval.py # 评估入口
├── run_pretrain.py # 预训练入口
Expand Down Expand Up @@ -42,7 +42,7 @@
xz -d openwebtext.tar.xz
tar xf openwebtext.tar
mkdir raw_data
bash decompress.sh
bash decompress.sh
```

解压以后得到的`raw_data`目录大小约为54GB。
Expand Down Expand Up @@ -87,7 +87,7 @@ CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
--save_steps 100000\
--decay_steps 320000\
--warmup_rate 0.01\
--batch_size 4\
--micro_batch_size 4\
--device gpu
```

Expand All @@ -99,7 +99,7 @@ CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
- `grad_clip` 梯度裁剪范围。
- `max_steps` 最大训练步数
- `save_steps` 保存模型间隔
- `batch_size` 训练的batch大小
- `mirco_batch_size` 训练的batch大小
- `device` 训练设备

用户也可以使用提供的shell脚本直接训练`sh scripts/run.sh`.
Expand All @@ -121,7 +121,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \
--save_steps 100000\
--decay_steps 320000\
--warmup_rate 0.01\
--batch_size 4\
--micro_batch_size 4\
--device gpu
```

Expand Down Expand Up @@ -153,7 +153,7 @@ python run_eval.py --model_name gpt2-medium-en \
其中参数释义如下:
`model_name` 使用的模型名称,如gpt2-medium-en等。
`eval_path` 数据集地址。
`init_checkpoint_path` 模型参数地址
`init_checkpoint_path` 模型参数地址
`batch_size` batch size大小。
`device` 运行设备,cpu,gpu,xpu可选。
`overlapping_eval` wikitext数据集参数。
Expand All @@ -167,20 +167,22 @@ python run_eval.py --model_name gpt2-medium-en \
本项目提供了简单的文本生成的demo,供用户测试文本生成效果。

```shell
python generate_sample.py
# 中文示例
python predict.py gpt-cn
# 英文示例
python predict.py
```

生成效果展示:
```text
问题:中国的首都是哪里?答案:北京。
问题:苹果的CEO是谁? 答案:
乔布斯。
问题:苹果的CEO是谁? 答案:乔布斯。
默写古诗: 大漠孤烟直,长河落日圆。
举杯邀明月,
举杯邀明月,对影成三人。
对影成三人。
Question: Who is the CEO of Apple?
Answer: Tim Cook.
```

## 模型导出预测
Expand Down Expand Up @@ -223,6 +225,7 @@ sh scripts/run_static.sh

## 参考文献
- [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf)
- [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473)
8 changes: 4 additions & 4 deletions examples/language_model/gpt/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@
parser.add_argument("--model_name", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: "
+ ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])), )
parser.add_argument("--eval_path", default=None, type=str, required=True, help="The eval file path.", )
parser.add_argument('--cloze_eval', action='store_true', help='Evaluation dataset from `--eval_path` is a cloze task')
parser.add_argument('--overlapping_eval', type=int, default=32, help='Sliding window for overlapping eval ')
parser.add_argument("--init_checkpoint_path", default=None, type=str, help="The model checkpoint path.", )
parser.add_argument( "--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", )
parser.add_argument('--cloze_eval', action='store_true', help='Evaluation dataset from `--eval_path` is a cloze task.')
parser.add_argument('--overlapping_eval', type=int, default=32, help='Sliding window for overlapping eval.')
parser.add_argument("--init_checkpoint_path", default=None, type=str, help="The model checkpoint path.")
parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument('--seq_length', type=int, default=1024, help='Maximum sequence length to process for evaluation.')
parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu"], help="Select cpu, gpu, xpu devices.")
parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
Expand Down
12 changes: 5 additions & 7 deletions examples/language_model/gpt/scripts/run.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
set -x
export CUDA_VISIBLE_DEVICES=0
export FLAGS_fraction_of_gpu_memory_to_use=1.0

PYTHONPATH=../../../ python -u run_pretrain.py \
python -u run_pretrain.py \
--model_type "gpt"\
--model_name_or_path "gpt2-en"\
--input_dir "./data"\
Expand All @@ -10,14 +10,12 @@ PYTHONPATH=../../../ python -u run_pretrain.py \
--micro_batch_size 4\
--max_lr 0.00015\
--min_lr 0.00001\
--max_steps 70000\
--save_steps 70000\
--max_steps 500000\
--save_steps 100000\
--decay_steps 320000\
--weight_decay 0.01\
--warmup_rate 0.01\
--grad_clip 1.0\
--logging_freq 1\
--eval_freq 500\
--eval_freq 1000\
--device "gpu"


2 changes: 1 addition & 1 deletion examples/language_model/gpt/scripts/run_multi.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ task_name="gpt-dygraph"
rm -rf output/$task_name/log

unset CUDA_VISIBLE_DEVICES
PYTHONPATH=../../../ python -m paddle.distributed.launch \
python -m paddle.distributed.launch \
--gpus "0,1,2,3" \
--log_dir "output/$task_name/log" run_pretrain.py \
--model_type "gpt" \
Expand Down
9 changes: 3 additions & 6 deletions examples/language_model/gpt/scripts/run_static.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
set -x
export PADDLE_WITH_GLOO=0
export GLOG_v=0
export NCCL_DEBUG=INFO
export FLAGS_call_stack_level=2
export FLAGS_allocator_strategy=naive_best_fit
export FLAGS_fraction_of_gpu_memory_to_use=0.98
unset CUDA_VISIBLE_DEVICES

rm -rf *.prototxt
Expand All @@ -15,7 +12,7 @@ rm -rf main_sharding*
task_name="gpt-mp-sharding"
rm -rf output/$task_name/log

PYTHONPATH=../../../ python -u -m paddle.distributed.fleet.launch \
python -u -m paddle.distributed.fleet.launch \
--gpus "0,1,2,3,4,5,6,7" \
--log_dir "output/$task_name/log" run_pretrain_static.py \
--model_type "gpt" \
Expand All @@ -34,8 +31,8 @@ PYTHONPATH=../../../ python -u -m paddle.distributed.fleet.launch \
--use_recompute true \
--max_lr 0.00015 \
--min_lr 0.00001 \
--max_steps 70000 \
--save_steps 70000 \
--max_steps 500000 \
--save_steps 100000 \
--decay_steps 320000 \
--weight_decay 0.01\
--warmup_rate 0.01 \
Expand Down
22 changes: 11 additions & 11 deletions paddlenlp/transformers/gpt/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
from paddle.distributed.fleet import fleet

from .. import PretrainedModel, register_base_model
import paddlenlp.ops as ops
import paddlenlp

__all__ = [
'GPTModel',
Expand Down Expand Up @@ -80,22 +80,22 @@ def __init__(self,
self.out_proj = nn.Linear(
embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
else:
self.q_proj = ops.ColumnParallelLiner(
self.q_proj = paddlenlp.ops.ColumnParallelLiner(
(embed_dim, embed_dim),
topo.mp_info.size,
weight_attr,
bias_attr=bias_attr)
self.k_proj = ops.ColumnParallelLiner(
self.k_proj = paddlenlp.ops.ColumnParallelLiner(
(self.kdim, embed_dim),
topo.mp_info.size,
weight_attr,
bias_attr=bias_attr)
self.v_proj = ops.ColumnParallelLiner(
self.v_proj = paddlenlp.ops.ColumnParallelLiner(
(self.vdim, embed_dim),
topo.mp_info.size,
weight_attr,
bias_attr=bias_attr)
self.out_proj = ops.RowParallelLiner(
self.out_proj = paddlenlp.ops.RowParallelLiner(
(embed_dim, embed_dim),
topo.mp_info.size,
weight_attr,
Expand Down Expand Up @@ -357,12 +357,12 @@ def __init__(self,
weight_attrs[2],
bias_attr=bias_attrs[2])
else:
self.linear1 = ops.ColumnParallelLiner(
self.linear1 = paddlenlp.ops.ColumnParallelLiner(
(d_model, dim_feedforward),
topo.mp_info.size,
weight_attrs[2],
bias_attr=bias_attrs[2])
self.linear2 = ops.RowParallelLiner(
self.linear2 = paddlenlp.ops.RowParallelLiner(
(dim_feedforward, d_model),
topo.mp_info.size,
weight_attrs[2],
Expand Down Expand Up @@ -432,7 +432,7 @@ def __init__(self,
initializer=nn.initializer.Normal(
mean=0.0, std=initializer_range)))
#else:
# self.word_embeddings = ops.ParallelEmbedding(
# self.word_embeddings = paddlenlp.ops.ParallelEmbedding(
# vocab_size,
# hidden_size,
# topo.mp_info.size,
Expand Down Expand Up @@ -607,8 +607,8 @@ def __init__(self,
for i in range(num_hidden_layers):
DecoderLayer = TransformerDecoderLayer
if self.pipline_mode:
DecoderLayer = ops.guard(f'gpu:{i//self.layer_per_stage}')(
TransformerDecoderLayer)
DecoderLayer = paddlenlp.ops.guard(
f'gpu:{i//self.layer_per_stage}')(TransformerDecoderLayer)
decoder_layers.append(
DecoderLayer(
d_model=hidden_size,
Expand All @@ -625,7 +625,7 @@ def __init__(self,
topo=topo))

if self.pipline_mode:
Decoder = ops.guard(f'gpu:{self.topo.pp_info.size-1}')(
Decoder = paddlenlp.ops.guard(f'gpu:{self.topo.pp_info.size-1}')(
TransformerDecoder)
else:
Decoder = TransformerDecoder
Expand Down

0 comments on commit fb7c288

Please sign in to comment.