[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

ZHUI · 2022-03-11T12:57:59Z

PR types

New features

PR changes

Others

Description

Finetune for all tasks.

示例用法：

使用PaddleNLP内置模型权重 finetune

python finetune.py  --dataset cmrc2018  --model_name_or_path ernie-1.0

使用预训练checkpoint 进行 finetune

python finetune.py  --dataset "clue ocnli"  --model_name_or_path ./output/ernie-dygraph_clue14g-dp8-gb512/model_500000/

Logs of Trainer

(base) root@yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com finetune $ PYTHONPATH=../../../../ python run_seq_cls.py  --dataset chnsenticorp_v2  --model_name_or_path ernie-1.0 --fp16 true --fp16_opt_level O2 --output_dir tmp --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 50 --metric_for_best_model eval_accuracy  --load_best_model_at_end true
/nfs/zhonghui03/anaconda3/lib/python3.8/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(
[2022-03-25 16:40:31,350] [ WARNING] - Process rank: -1, device: gpu:0, world_size: 1, distributed training: False, 16-bits training: True
[2022-03-25 16:40:31,404] [    INFO] - Already cached /root/.paddlenlp/models/ernie-1.0/vocab.txt
[2022-03-25 16:40:31,415] [    INFO] - Already cached /root/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
W0325 16:40:31.416553 105348 gpu_context.cc:240] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 11.2
W0325 16:40:31.419184 105348 gpu_context.cc:268] device: 0, cuDNN Version: 8.0.
W0325 16:40:33.843657 105348 gpu_context.cc:449] WARNING: device: . The installed Paddle is compiled with CUDNN 8.1, but CUDNN version in your machine is 8.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
[2022-03-25 16:40:38,432] [    INFO] - Using  half precision
[2022-03-25 16:40:38,432] [    INFO] - ============================================================
[2022-03-25 16:40:38,432] [    INFO] -         Configuration Arguments         
[2022-03-25 16:40:38,432] [    INFO] - paddle commit id              :9c2cee1c6c59ac440d4abd3d7ece8a5f8a140bd8
[2022-03-25 16:40:38,432] [    INFO] - adam_beta1                    :0.9
[2022-03-25 16:40:38,433] [    INFO] - adam_beta2                    :0.999
[2022-03-25 16:40:38,433] [    INFO] - adam_epsilon                  :1e-08
[2022-03-25 16:40:38,433] [    INFO] - dataloader_drop_last          :False
[2022-03-25 16:40:38,433] [    INFO] - dataloader_num_workers        :0
[2022-03-25 16:40:38,433] [    INFO] - debug                         :[]
[2022-03-25 16:40:38,433] [    INFO] - device                        :gpu:0
[2022-03-25 16:40:38,433] [    INFO] - disable_tqdm                  :False
[2022-03-25 16:40:38,433] [    INFO] - do_eval                       :True
[2022-03-25 16:40:38,433] [    INFO] - do_predict                    :False
[2022-03-25 16:40:38,433] [    INFO] - do_train                      :False
[2022-03-25 16:40:38,433] [    INFO] - eval_accumulation_steps       :None
[2022-03-25 16:40:38,433] [    INFO] - eval_batch_size               :128
[2022-03-25 16:40:38,433] [    INFO] - eval_steps                    :100
[2022-03-25 16:40:38,434] [    INFO] - evaluation_strategy           :IntervalStrategy.STEPS
[2022-03-25 16:40:38,434] [    INFO] - fp16                          :True
[2022-03-25 16:40:38,434] [    INFO] - fp16_opt_level                :O2
[2022-03-25 16:40:38,434] [    INFO] - gradient_accumulation_steps   :1
[2022-03-25 16:40:38,434] [    INFO] - greater_is_better             :True
[2022-03-25 16:40:38,434] [    INFO] - ignore_data_skip              :False
[2022-03-25 16:40:38,434] [    INFO] - label_names                   :None
[2022-03-25 16:40:38,434] [    INFO] - learning_rate                 :5e-05
[2022-03-25 16:40:38,434] [    INFO] - load_best_model_at_end        :True
[2022-03-25 16:40:38,434] [    INFO] - local_process_index           :0
[2022-03-25 16:40:38,434] [    INFO] - local_rank                    :-1
[2022-03-25 16:40:38,434] [    INFO] - log_level                     :-1
[2022-03-25 16:40:38,434] [    INFO] - log_level_replica             :-1
[2022-03-25 16:40:38,434] [    INFO] - log_on_each_node              :True
[2022-03-25 16:40:38,435] [    INFO] - logging_dir                   :tmp/runs/Mar25_16-40-31_yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com
[2022-03-25 16:40:38,435] [    INFO] - logging_first_step            :False
[2022-03-25 16:40:38,435] [    INFO] - logging_steps                 :10
[2022-03-25 16:40:38,435] [    INFO] - logging_strategy              :IntervalStrategy.STEPS
[2022-03-25 16:40:38,435] [    INFO] - lr_scheduler_type             :SchedulerType.LINEAR
[2022-03-25 16:40:38,435] [    INFO] - max_grad_norm                 :1.0
[2022-03-25 16:40:38,435] [    INFO] - max_steps                     :-1
[2022-03-25 16:40:38,435] [    INFO] - metric_for_best_model         :eval_accuracy
[2022-03-25 16:40:38,435] [    INFO] - minimum_eval_times            :1
[2022-03-25 16:40:38,435] [    INFO] - no_cuda                       :False
[2022-03-25 16:40:38,435] [    INFO] - num_train_epochs              :8
[2022-03-25 16:40:38,435] [    INFO] - optim                         :OptimizerNames.ADAMW
[2022-03-25 16:40:38,435] [    INFO] - output_dir                    :tmp
[2022-03-25 16:40:38,435] [    INFO] - overwrite_output_dir          :False
[2022-03-25 16:40:38,436] [    INFO] - past_index                    :-1
[2022-03-25 16:40:38,436] [    INFO] - per_device_eval_batch_size    :128
[2022-03-25 16:40:38,436] [    INFO] - per_device_train_batch_size   :128
[2022-03-25 16:40:38,436] [    INFO] - prediction_loss_only          :False
[2022-03-25 16:40:38,436] [    INFO] - process_index                 :0
[2022-03-25 16:40:38,436] [    INFO] - report_to                     :None
[2022-03-25 16:40:38,436] [    INFO] - resume_from_checkpoint        :None
[2022-03-25 16:40:38,436] [    INFO] - run_name                      :tmp
[2022-03-25 16:40:38,436] [    INFO] - save_on_each_node             :False
[2022-03-25 16:40:38,436] [    INFO] - save_steps                    :100
[2022-03-25 16:40:38,436] [    INFO] - save_strategy                 :IntervalStrategy.STEPS
[2022-03-25 16:40:38,436] [    INFO] - save_total_limit              :None
[2022-03-25 16:40:38,436] [    INFO] - scale_loss                    :32768
[2022-03-25 16:40:38,436] [    INFO] - seed                          :42
[2022-03-25 16:40:38,437] [    INFO] - should_log                    :True
[2022-03-25 16:40:38,437] [    INFO] - should_save                   :True
[2022-03-25 16:40:38,437] [    INFO] - skip_memory_metrics           :True
[2022-03-25 16:40:38,437] [    INFO] - train_batch_size              :128
[2022-03-25 16:40:38,437] [    INFO] - warmup_ratio                  :0.0
[2022-03-25 16:40:38,437] [    INFO] - warmup_steps                  :100
[2022-03-25 16:40:38,437] [    INFO] - weight_decay                  :0.01
[2022-03-25 16:40:38,437] [    INFO] - world_size                    :1
[2022-03-25 16:40:38,437] [    INFO] - ============================================================
[2022-03-25 16:40:38,439] [    INFO] - ***** Running training *****
[2022-03-25 16:40:38,439] [    INFO] -   Num examples = 9600
[2022-03-25 16:40:38,439] [    INFO] -   Num Epochs = 8
[2022-03-25 16:40:38,439] [    INFO] -   Instantaneous batch size per device = 128
[2022-03-25 16:40:38,439] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 128
[2022-03-25 16:40:38,439] [    INFO] -   Gradient Accumulation steps = 1
[2022-03-25 16:40:38,439] [    INFO] -   Total optimization steps = 600
[2022-03-25 16:40:38,439] [    INFO] -   Total num train samples = 76800
{'loss': 0.7314, 'learning_rate': 5e-06, 'global_step': 10, 'epoch': 0.13}                                                                                                   
{'loss': 0.689, 'learning_rate': 1e-05, 'global_step': 20, 'epoch': 0.27}                                                                                                    
{'loss': 0.6192, 'learning_rate': 1.5e-05, 'global_step': 30, 'epoch': 0.4}                                                                                                  
{'loss': 0.4854, 'learning_rate': 2e-05, 'global_step': 40, 'epoch': 0.53}                                                                                                   
{'loss': 0.341, 'learning_rate': 2.5e-05, 'global_step': 50, 'epoch': 0.67}                                                                                                  
{'loss': 0.2627, 'learning_rate': 3e-05, 'global_step': 60, 'epoch': 0.8}                                                                                                    
{'loss': 0.2923, 'learning_rate': 3.5e-05, 'global_step': 70, 'epoch': 0.93}                                                                                                 
{'loss': 0.2622, 'learning_rate': 4e-05, 'global_step': 80, 'epoch': 1.07}                                                                                                   
{'loss': 0.1903, 'learning_rate': 4.5e-05, 'global_step': 90, 'epoch': 1.2}                                                                                                  
{'loss': 0.2209, 'learning_rate': 5e-05, 'global_step': 100, 'epoch': 1.33}                                                                                                  
 17%|██████████████████████▎                                                                                                               | 100/600 [00:32<02:30,  3.32it/s][2022-03-25 16:41:11,192] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:41:11,193] [    INFO] -   Num examples = 1200
[2022-03-25 16:41:11,193] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:41:11,193] [    INFO] -   Total Batch size = 128
[2022-03-25 16:41:11,193] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.2228255271911621, 'eval_accuracy': 0.9216666666666666, 'eval_runtime': 1.6054, 'eval_samples_per_second': 747.496, 'eval_steps_per_second': 6.229, 'epoch': 1.33}                                                                                                                                                                         
 17%|██████████████████████▎                                                                                                               | 100/600 [00:34<02:30,  3.32it/s[2022-03-25 16:41:12,798] [    INFO] - Saving model checkpoint to tmp/checkpoint-100                                                                                          
{'loss': 0.1989, 'learning_rate': 4.9e-05, 'global_step': 110, 'epoch': 1.47}                                                                                                
{'loss': 0.2085, 'learning_rate': 4.8e-05, 'global_step': 120, 'epoch': 1.6}                                                                                                 
{'loss': 0.1784, 'learning_rate': 4.7e-05, 'global_step': 130, 'epoch': 1.73}                                                                                                
{'loss': 0.1926, 'learning_rate': 4.600000000000001e-05, 'global_step': 140, 'epoch': 1.87}                                                                                  
{'loss': 0.1856, 'learning_rate': 4.5e-05, 'global_step': 150, 'epoch': 2.0}                                                                                                 
{'loss': 0.1202, 'learning_rate': 4.4000000000000006e-05, 'global_step': 160, 'epoch': 2.13}                                                                                 
{'loss': 0.1087, 'learning_rate': 4.3e-05, 'global_step': 170, 'epoch': 2.27}                                                                                                
{'loss': 0.1077, 'learning_rate': 4.2e-05, 'global_step': 180, 'epoch': 2.4}                                                                                                 
{'loss': 0.1355, 'learning_rate': 4.1e-05, 'global_step': 190, 'epoch': 2.53}                                                                                                
{'loss': 0.1252, 'learning_rate': 4e-05, 'global_step': 200, 'epoch': 2.67}                                                                                                  
 33%|████████████████████████████████████████████▋                                                                                         | 200/600 [01:11<01:58,  3.37it/s][2022-03-25 16:41:50,060] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:41:50,060] [    INFO] -   Num examples = 1200
[2022-03-25 16:41:50,060] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:41:50,060] [    INFO] -   Total Batch size = 128
[2022-03-25 16:41:50,060] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.1823502629995346, 'eval_accuracy': 0.9341666666666667, 'eval_runtime': 1.6864, 'eval_samples_per_second': 711.591, 'eval_steps_per_second': 5.93, 'epoch': 2.67}                                                                                                                                                                          
 33%|████████████████████████████████████████████▋                                                                                         | 200/600 [01:13<01:58,  3.37it/s[2022-03-25 16:41:51,747] [    INFO] - Saving model checkpoint to tmp/checkpoint-200                                                                                          
{'loss': 0.1086, 'learning_rate': 3.9000000000000006e-05, 'global_step': 210, 'epoch': 2.8}                                                                                  
{'loss': 0.0988, 'learning_rate': 3.8e-05, 'global_step': 220, 'epoch': 2.93}                                                                                                
{'loss': 0.0968, 'learning_rate': 3.7e-05, 'global_step': 230, 'epoch': 3.07}                                                                                                
{'loss': 0.0649, 'learning_rate': 3.6e-05, 'global_step': 240, 'epoch': 3.2}                                                                                                 
{'loss': 0.0448, 'learning_rate': 3.5e-05, 'global_step': 250, 'epoch': 3.33}                                                                                                
{'loss': 0.0595, 'learning_rate': 3.4000000000000007e-05, 'global_step': 260, 'epoch': 3.47}                                                                                 
{'loss': 0.0509, 'learning_rate': 3.3e-05, 'global_step': 270, 'epoch': 3.6}                                                                                                 
{'loss': 0.06, 'learning_rate': 3.2000000000000005e-05, 'global_step': 280, 'epoch': 3.73}                                                                                   
{'loss': 0.0702, 'learning_rate': 3.1e-05, 'global_step': 290, 'epoch': 3.87}                                                                                                
{'loss': 0.0549, 'learning_rate': 3e-05, 'global_step': 300, 'epoch': 4.0}                                                                                                   
 50%|███████████████████████████████████████████████████████████████████                                                                   | 300/600 [01:49<00:48,  6.17it/s][2022-03-25 16:42:28,050] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:42:28,050] [    INFO] -   Num examples = 1200
[2022-03-25 16:42:28,050] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:42:28,050] [    INFO] -   Total Batch size = 128
[2022-03-25 16:42:28,050] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.1881168633699417, 'eval_accuracy': 0.94, 'eval_runtime': 1.6446, 'eval_samples_per_second': 729.682, 'eval_steps_per_second': 6.081, 'epoch': 4.0}           
 50%|███████████████████████████████████████████████████████████████████                                                                   | 300/600 [01:51<00:48,  6.17it/s[2022-03-25 16:42:29,696] [    INFO] - Saving model checkpoint to tmp/checkpoint-300                                                                                          
{'loss': 0.0435, 'learning_rate': 2.9e-05, 'global_step': 310, 'epoch': 4.13}                                                                                                
{'loss': 0.03, 'learning_rate': 2.8000000000000003e-05, 'global_step': 320, 'epoch': 4.27}                                                                                   
{'loss': 0.0304, 'learning_rate': 2.7000000000000002e-05, 'global_step': 330, 'epoch': 4.4}                                                                                  
{'loss': 0.0355, 'learning_rate': 2.6000000000000002e-05, 'global_step': 340, 'epoch': 4.53}                                                                                 
{'loss': 0.0345, 'learning_rate': 2.5e-05, 'global_step': 350, 'epoch': 4.67}                                                                                                
{'loss': 0.0266, 'learning_rate': 2.4e-05, 'global_step': 360, 'epoch': 4.8}                                                                                                 
{'loss': 0.0249, 'learning_rate': 2.3000000000000003e-05, 'global_step': 370, 'epoch': 4.93}                                                                                 
{'loss': 0.0237, 'learning_rate': 2.2000000000000003e-05, 'global_step': 380, 'epoch': 5.07}                                                                                 
{'loss': 0.0187, 'learning_rate': 2.1e-05, 'global_step': 390, 'epoch': 5.2}                                                                                                 
{'loss': 0.0149, 'learning_rate': 2e-05, 'global_step': 400, 'epoch': 5.33}                                                                                                  
 67%|█████████████████████████████████████████████████████████████████████████████████████████▎                                            | 400/600 [02:29<00:57,  3.51it/s][2022-03-25 16:43:07,458] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:43:07,459] [    INFO] -   Num examples = 1200
[2022-03-25 16:43:07,459] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:43:07,459] [    INFO] -   Total Batch size = 128
[2022-03-25 16:43:07,459] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.25193360447883606, 'eval_accuracy': 0.9425, 'eval_runtime': 1.8026, 'eval_samples_per_second': 665.703, 'eval_steps_per_second': 5.548, 'epoch': 5.33}       
 67%|█████████████████████████████████████████████████████████████████████████████████████████▎                                            | 400/600 [02:30<00:57,  3.51it/s[2022-03-25 16:43:09,262] [    INFO] - Saving model checkpoint to tmp/checkpoint-400                                                                                          
{'loss': 0.0231, 'learning_rate': 1.9e-05, 'global_step': 410, 'epoch': 5.47}                                                                                                
{'loss': 0.0154, 'learning_rate': 1.8e-05, 'global_step': 420, 'epoch': 5.6}                                                                                                 
{'loss': 0.0279, 'learning_rate': 1.7000000000000003e-05, 'global_step': 430, 'epoch': 5.73}                                                                                 
{'loss': 0.0194, 'learning_rate': 1.6000000000000003e-05, 'global_step': 440, 'epoch': 5.87}                                                                                 
{'loss': 0.0147, 'learning_rate': 1.5e-05, 'global_step': 450, 'epoch': 6.0}                                                                                                 
{'loss': 0.0085, 'learning_rate': 1.4000000000000001e-05, 'global_step': 460, 'epoch': 6.13}                                                                                 
{'loss': 0.0097, 'learning_rate': 1.3000000000000001e-05, 'global_step': 470, 'epoch': 6.27}                                                                                 
{'loss': 0.0112, 'learning_rate': 1.2e-05, 'global_step': 480, 'epoch': 6.4}                                                                                                 
{'loss': 0.0187, 'learning_rate': 1.1000000000000001e-05, 'global_step': 490, 'epoch': 6.53}                                                                                 
{'loss': 0.0106, 'learning_rate': 1e-05, 'global_step': 500, 'epoch': 6.67}                                                                                                  
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 500/600 [03:08<00:30,  3.25it/s][2022-03-25 16:43:46,542] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:43:46,543] [    INFO] -   Num examples = 1200
[2022-03-25 16:43:46,543] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:43:46,543] [    INFO] -   Total Batch size = 128
[2022-03-25 16:43:46,543] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.2666992247104645, 'eval_accuracy': 0.9475, 'eval_runtime': 1.6861, 'eval_samples_per_second': 711.688, 'eval_steps_per_second': 5.931, 'epoch': 6.67}        
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 500/600 [03:09<00:30,  3.25it/s[2022-03-25 16:43:48,230] [    INFO] - Saving model checkpoint to tmp/checkpoint-500                                                                                          
{'loss': 0.0181, 'learning_rate': 9e-06, 'global_step': 510, 'epoch': 6.8}                                                                                                   
{'loss': 0.015, 'learning_rate': 8.000000000000001e-06, 'global_step': 520, 'epoch': 6.93}                                                                                   
{'loss': 0.0064, 'learning_rate': 7.000000000000001e-06, 'global_step': 530, 'epoch': 7.07}                                                                                  
{'loss': 0.0128, 'learning_rate': 6e-06, 'global_step': 540, 'epoch': 7.2}                                                                                                   
{'loss': 0.0116, 'learning_rate': 5e-06, 'global_step': 550, 'epoch': 7.33}                                                                                                  
{'loss': 0.0089, 'learning_rate': 4.000000000000001e-06, 'global_step': 560, 'epoch': 7.47}                                                                                  
{'loss': 0.0096, 'learning_rate': 3e-06, 'global_step': 570, 'epoch': 7.6}                                                                                                   
{'loss': 0.0073, 'learning_rate': 2.0000000000000003e-06, 'global_step': 580, 'epoch': 7.73}                                                                                 
{'loss': 0.0149, 'learning_rate': 1.0000000000000002e-06, 'global_step': 590, 'epoch': 7.87}                                                                                 
{'loss': 0.006, 'learning_rate': 0.0, 'global_step': 600, 'epoch': 8.0}                                                                                                      
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:46<00:00,  6.34it/s][2022-03-25 16:44:24,858] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:44:24,858] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:24,858] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:24,859] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:24,859] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.28148436546325684, 'eval_accuracy': 0.9475, 'eval_runtime': 1.4858, 'eval_samples_per_second': 807.654, 'eval_steps_per_second': 6.73, 'epoch': 8.0}         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:47<00:00,  6.34it/s[2022-03-25 16:44:26,345] [    INFO] - Saving model checkpoint to tmp/checkpoint-600                                                                                          
[2022-03-25 16:44:33,450] [    INFO] - 
Training completed. 

[2022-03-25 16:44:33,450] [    INFO] - Loading best model from tmp/checkpoint-500 (score: 0.9475).
{'train_runtime': 235.7523, 'train_samples_per_second': 325.766, 'train_steps_per_second': 2.545, 'train_loss': 0.11529181957244873, 'epoch': 8.0}                           
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:55<00:00,  2.55it/s]
[2022-03-25 16:44:34,195] [    INFO] - Saving model checkpoint to tmp
***** train metrics *****
  epoch                    =        8.0
  train_loss               =     0.1153
  train_runtime            = 0:03:55.75
  train_samples_per_second =    325.766
  train_steps_per_second   =      2.545
[2022-03-25 16:44:36,504] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:44:36,505] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:36,505] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:36,505] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:36,505] [    INFO] -   Total prediction steps = 10
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 10.38it/s]
***** eval metrics *****
  epoch                   =        8.0
  eval_accuracy           =     0.9475
  eval_loss               =     0.2667
  eval_runtime            = 0:00:01.57
  eval_samples_per_second =     761.58
  eval_steps_per_second   =      6.347
[2022-03-25 16:44:38,080] [    INFO] - ***** Running Prediction *****
[2022-03-25 16:44:38,080] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:38,080] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:38,081] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:38,081] [    INFO] -   Total prediction steps = 10
 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 9/10 [00:00<00:00, 12.09it/s]***** test metrics *****
  test_accuracy           = 0.9591666666666666
  test_loss               =             0.1813
  test_runtime            =         0:00:01.65
  test_samples_per_second =             726.53
  test_steps_per_second   =              6.054
[2022-03-25 16:44:39,732] [    INFO] - Loading best model from tmp/checkpoint-500 (score: 0.9475).
[2022-03-25 16:44:40,530] [    INFO] - Exporting inference model to tmp/inference/infer
[2022-03-25 16:44:48,950] [    INFO] - Inference model exported.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.02s/it]
(base) root@yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com finetune $

更新日志

3.22更新，初步支持fp16, O1, O2训练
3.24更新，支持多卡训练，多卡评估，梯度累积功能
3.25更新，支持预测，导出inference model

TODO:

check 多卡evalute正确性
check qa 任务效果正确性
check 训练效果

ZeyuChen · 2022-03-18T15:09:26Z

Coding style CI没过。

wawltor

LGTM

wawltor · 2022-03-28T03:35:35Z

requirements.txt

+datasets
+tqdm


tqdm是否一定要放在requirements里面了？

建议放一下，默认会 import tqdm。
这里 datasets 是会依赖 tqdm的，实际上不加也可以。

wawltor · 2022-03-28T03:50:43Z

examples/language_model/ernie-1.0/finetune/question_answering.py

+from paddlenlp.metrics.squad import squad_evaluate, compute_prediction
+
+from paddlenlp.trainer.trainer_base import TrainerBase
+from paddlenlp.utils.log import logger


我看library增加一下新的logging模块，是否需要新增logging模块，是否在logger模块上进行优化升级了？

已删除 新的logging模块, trainer 这边之前主要有些日志分级控制、重定向文件输出等能力，后续可以升级

wawltor · 2022-03-28T07:26:59Z

examples/language_model/ernie-1.0/finetune/utils.py

+
+
+@dataclass
+class DataTrainingArguments:


这里的Arguments设置可以放到paddlenlp library里面，这样的话后续可以统一一些配置，例如max_seq_length这种设置在各个任务里面都可以直接使用

方案是可行的。

但放库里面有一个问题，用户缺少自定义参数的示例。当用户要添加自定参数需求，会缺少指引

wawltor · 2022-03-28T07:30:07Z

examples/language_model/ernie-1.0/finetune/token_classification.py

@@ -0,0 +1,171 @@
+# Copyright 2020-present the HuggingFace Inc. team.


这里有个疑问，这个示例为什么要有HF的Copyright

examples/language_model/ernie-1.0/finetune/token_classification.py

examples/language_model/ernie-1.0/finetune/sequence_classification.py

wawltor · 2022-03-28T07:43:35Z

examples/language_model/ernie-1.0/finetune/sequence_classification.py

+    return batchify_fn
+
+
+class Dict(object):


在paddlenlp里面Dict class，这里区别是什么了？

这里的 dict 返回出来还是dict，原来paddlenlp里面返回为 list。

wawltor · 2022-03-28T07:45:07Z

examples/language_model/ernie-1.0/finetune/run_seq_cls.py

+        paddle.static.InputSpec(
+            shape=[None, None], dtype="int64")  # segment_ids
+    ]
+    trainer.export_model(input_spec=input_spec, load_best_model=True)


输出模型path可以让用户自定义

wawltor

LGTM

…r ernie-1.0 pretraining. (PaddlePaddle#1761) * add some datasets for finetune. * support fine tune for all tastks. * add trainer prototype. * init verison for paddlenlp trainer. * refine trainer. * update for some details. * support multi-cards training evaluation. * support load from ckpt. * support for export inference model. * first version of trainer. * seq cls support clue. * trainer support for token classification and question answersing tasks. * fix as reviews. Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>

ZHUI added 4 commits March 11, 2022 20:53

add some datasets for finetune.

0247210

support fine tune for all tastks.

654d45a

add trainer prototype.

e4f2f02

init verison for paddlenlp trainer.

68dea62

ZHUI force-pushed the finetune branch from 3416438 to 68dea62 Compare March 18, 2022 11:53

Merge branch 'develop' into finetune

a3e53c7

ZHUI added 5 commits March 21, 2022 12:51

refine trainer.

2edc6e3

update for some details.

ca08daa

support multi-cards training evaluation.

fdadab9

support load from ckpt.

984ff98

support for export inference model.

615973d

wawltor previously approved these changes Mar 25, 2022

View reviewed changes

first version of trainer.

b7b2c77

ZHUI dismissed wawltor’s stale review via b7b2c77 March 25, 2022 13:53

ZHUI added 4 commits March 25, 2022 21:54

Merge branch 'develop' into finetune

4985edf

fix file

6e4113f

add init

6b70df7

Merge branch 'develop' into finetune

8058493

ZHUI requested a review from wawltor March 28, 2022 04:52

wawltor reviewed Mar 28, 2022

View reviewed changes

ZHUI changed the title ~~[PRETRAIN] Finetune for all tasks.~~ [Train] PaddleNLP trainer and finetune ernie-1.0 pretrain. Mar 29, 2022

ZHUI changed the title ~~[Train] PaddleNLP trainer and finetune ernie-1.0 pretrain.~~ [Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. Mar 29, 2022

ZHUI added 4 commits March 29, 2022 18:11

seq cls support clue.

27efa97

trainer support for token classification and question answersing tasks.

3c412ff

fix as reviews.

36ff4cd

Merge branch 'develop' into finetune

a95320b

wawltor approved these changes Mar 30, 2022

View reviewed changes

ZHUI merged commit 44a290e into PaddlePaddle:develop Mar 30, 2022

ZHUI mentioned this pull request May 4, 2022

PaddleNLP v2.3rc Release Note Candidate #2031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

ZHUI commented Mar 11, 2022 •

edited

Loading

ZeyuChen commented Mar 18, 2022

wawltor left a comment

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor Mar 28, 2022

ZHUI Mar 30, 2022

wawltor left a comment

		@@ -0,0 +1,171 @@
		# Copyright 2020-present the HuggingFace Inc. team.

		datasets
		tqdm



		@dataclass
		class DataTrainingArguments:

[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

Conversation

ZHUI commented Mar 11, 2022 • edited Loading

PR types

PR changes

Description

Logs of Trainer

更新日志

ZeyuChen commented Mar 18, 2022

wawltor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

ZHUI commented Mar 11, 2022 •

edited

Loading