Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. #1761

Merged
merged 19 commits into from
Mar 30, 2022

Conversation

ZHUI
Copy link
Collaborator

@ZHUI ZHUI commented Mar 11, 2022

PR types

New features

PR changes

Others

Description

Finetune for all tasks.

示例用法:

  1. 使用PaddleNLP内置模型权重 finetune
python finetune.py  --dataset cmrc2018  --model_name_or_path ernie-1.0
  1. 使用预训练checkpoint 进行 finetune
python finetune.py  --dataset "clue ocnli"  --model_name_or_path ./output/ernie-dygraph_clue14g-dp8-gb512/model_500000/

Logs of Trainer

(base) root@yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com finetune $ PYTHONPATH=../../../../ python run_seq_cls.py  --dataset chnsenticorp_v2  --model_name_or_path ernie-1.0 --fp16 true --fp16_opt_level O2 --output_dir tmp --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 50 --metric_for_best_model eval_accuracy  --load_best_model_at_end true
/nfs/zhonghui03/anaconda3/lib/python3.8/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(
[2022-03-25 16:40:31,350] [ WARNING] - Process rank: -1, device: gpu:0, world_size: 1, distributed training: False, 16-bits training: True
[2022-03-25 16:40:31,404] [    INFO] - Already cached /root/.paddlenlp/models/ernie-1.0/vocab.txt
[2022-03-25 16:40:31,415] [    INFO] - Already cached /root/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
W0325 16:40:31.416553 105348 gpu_context.cc:240] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 11.2
W0325 16:40:31.419184 105348 gpu_context.cc:268] device: 0, cuDNN Version: 8.0.
W0325 16:40:33.843657 105348 gpu_context.cc:449] WARNING: device: . The installed Paddle is compiled with CUDNN 8.1, but CUDNN version in your machine is 8.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
[2022-03-25 16:40:38,432] [    INFO] - Using  half precision
[2022-03-25 16:40:38,432] [    INFO] - ============================================================
[2022-03-25 16:40:38,432] [    INFO] -         Configuration Arguments         
[2022-03-25 16:40:38,432] [    INFO] - paddle commit id              :9c2cee1c6c59ac440d4abd3d7ece8a5f8a140bd8
[2022-03-25 16:40:38,432] [    INFO] - adam_beta1                    :0.9
[2022-03-25 16:40:38,433] [    INFO] - adam_beta2                    :0.999
[2022-03-25 16:40:38,433] [    INFO] - adam_epsilon                  :1e-08
[2022-03-25 16:40:38,433] [    INFO] - dataloader_drop_last          :False
[2022-03-25 16:40:38,433] [    INFO] - dataloader_num_workers        :0
[2022-03-25 16:40:38,433] [    INFO] - debug                         :[]
[2022-03-25 16:40:38,433] [    INFO] - device                        :gpu:0
[2022-03-25 16:40:38,433] [    INFO] - disable_tqdm                  :False
[2022-03-25 16:40:38,433] [    INFO] - do_eval                       :True
[2022-03-25 16:40:38,433] [    INFO] - do_predict                    :False
[2022-03-25 16:40:38,433] [    INFO] - do_train                      :False
[2022-03-25 16:40:38,433] [    INFO] - eval_accumulation_steps       :None
[2022-03-25 16:40:38,433] [    INFO] - eval_batch_size               :128
[2022-03-25 16:40:38,433] [    INFO] - eval_steps                    :100
[2022-03-25 16:40:38,434] [    INFO] - evaluation_strategy           :IntervalStrategy.STEPS
[2022-03-25 16:40:38,434] [    INFO] - fp16                          :True
[2022-03-25 16:40:38,434] [    INFO] - fp16_opt_level                :O2
[2022-03-25 16:40:38,434] [    INFO] - gradient_accumulation_steps   :1
[2022-03-25 16:40:38,434] [    INFO] - greater_is_better             :True
[2022-03-25 16:40:38,434] [    INFO] - ignore_data_skip              :False
[2022-03-25 16:40:38,434] [    INFO] - label_names                   :None
[2022-03-25 16:40:38,434] [    INFO] - learning_rate                 :5e-05
[2022-03-25 16:40:38,434] [    INFO] - load_best_model_at_end        :True
[2022-03-25 16:40:38,434] [    INFO] - local_process_index           :0
[2022-03-25 16:40:38,434] [    INFO] - local_rank                    :-1
[2022-03-25 16:40:38,434] [    INFO] - log_level                     :-1
[2022-03-25 16:40:38,434] [    INFO] - log_level_replica             :-1
[2022-03-25 16:40:38,434] [    INFO] - log_on_each_node              :True
[2022-03-25 16:40:38,435] [    INFO] - logging_dir                   :tmp/runs/Mar25_16-40-31_yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com
[2022-03-25 16:40:38,435] [    INFO] - logging_first_step            :False
[2022-03-25 16:40:38,435] [    INFO] - logging_steps                 :10
[2022-03-25 16:40:38,435] [    INFO] - logging_strategy              :IntervalStrategy.STEPS
[2022-03-25 16:40:38,435] [    INFO] - lr_scheduler_type             :SchedulerType.LINEAR
[2022-03-25 16:40:38,435] [    INFO] - max_grad_norm                 :1.0
[2022-03-25 16:40:38,435] [    INFO] - max_steps                     :-1
[2022-03-25 16:40:38,435] [    INFO] - metric_for_best_model         :eval_accuracy
[2022-03-25 16:40:38,435] [    INFO] - minimum_eval_times            :1
[2022-03-25 16:40:38,435] [    INFO] - no_cuda                       :False
[2022-03-25 16:40:38,435] [    INFO] - num_train_epochs              :8
[2022-03-25 16:40:38,435] [    INFO] - optim                         :OptimizerNames.ADAMW
[2022-03-25 16:40:38,435] [    INFO] - output_dir                    :tmp
[2022-03-25 16:40:38,435] [    INFO] - overwrite_output_dir          :False
[2022-03-25 16:40:38,436] [    INFO] - past_index                    :-1
[2022-03-25 16:40:38,436] [    INFO] - per_device_eval_batch_size    :128
[2022-03-25 16:40:38,436] [    INFO] - per_device_train_batch_size   :128
[2022-03-25 16:40:38,436] [    INFO] - prediction_loss_only          :False
[2022-03-25 16:40:38,436] [    INFO] - process_index                 :0
[2022-03-25 16:40:38,436] [    INFO] - report_to                     :None
[2022-03-25 16:40:38,436] [    INFO] - resume_from_checkpoint        :None
[2022-03-25 16:40:38,436] [    INFO] - run_name                      :tmp
[2022-03-25 16:40:38,436] [    INFO] - save_on_each_node             :False
[2022-03-25 16:40:38,436] [    INFO] - save_steps                    :100
[2022-03-25 16:40:38,436] [    INFO] - save_strategy                 :IntervalStrategy.STEPS
[2022-03-25 16:40:38,436] [    INFO] - save_total_limit              :None
[2022-03-25 16:40:38,436] [    INFO] - scale_loss                    :32768
[2022-03-25 16:40:38,436] [    INFO] - seed                          :42
[2022-03-25 16:40:38,437] [    INFO] - should_log                    :True
[2022-03-25 16:40:38,437] [    INFO] - should_save                   :True
[2022-03-25 16:40:38,437] [    INFO] - skip_memory_metrics           :True
[2022-03-25 16:40:38,437] [    INFO] - train_batch_size              :128
[2022-03-25 16:40:38,437] [    INFO] - warmup_ratio                  :0.0
[2022-03-25 16:40:38,437] [    INFO] - warmup_steps                  :100
[2022-03-25 16:40:38,437] [    INFO] - weight_decay                  :0.01
[2022-03-25 16:40:38,437] [    INFO] - world_size                    :1
[2022-03-25 16:40:38,437] [    INFO] - ============================================================
[2022-03-25 16:40:38,439] [    INFO] - ***** Running training *****
[2022-03-25 16:40:38,439] [    INFO] -   Num examples = 9600
[2022-03-25 16:40:38,439] [    INFO] -   Num Epochs = 8
[2022-03-25 16:40:38,439] [    INFO] -   Instantaneous batch size per device = 128
[2022-03-25 16:40:38,439] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 128
[2022-03-25 16:40:38,439] [    INFO] -   Gradient Accumulation steps = 1
[2022-03-25 16:40:38,439] [    INFO] -   Total optimization steps = 600
[2022-03-25 16:40:38,439] [    INFO] -   Total num train samples = 76800
{'loss': 0.7314, 'learning_rate': 5e-06, 'global_step': 10, 'epoch': 0.13}                                                                                                   
{'loss': 0.689, 'learning_rate': 1e-05, 'global_step': 20, 'epoch': 0.27}                                                                                                    
{'loss': 0.6192, 'learning_rate': 1.5e-05, 'global_step': 30, 'epoch': 0.4}                                                                                                  
{'loss': 0.4854, 'learning_rate': 2e-05, 'global_step': 40, 'epoch': 0.53}                                                                                                   
{'loss': 0.341, 'learning_rate': 2.5e-05, 'global_step': 50, 'epoch': 0.67}                                                                                                  
{'loss': 0.2627, 'learning_rate': 3e-05, 'global_step': 60, 'epoch': 0.8}                                                                                                    
{'loss': 0.2923, 'learning_rate': 3.5e-05, 'global_step': 70, 'epoch': 0.93}                                                                                                 
{'loss': 0.2622, 'learning_rate': 4e-05, 'global_step': 80, 'epoch': 1.07}                                                                                                   
{'loss': 0.1903, 'learning_rate': 4.5e-05, 'global_step': 90, 'epoch': 1.2}                                                                                                  
{'loss': 0.2209, 'learning_rate': 5e-05, 'global_step': 100, 'epoch': 1.33}                                                                                                  
 17%|██████████████████████▎                                                                                                               | 100/600 [00:32<02:30,  3.32it/s][2022-03-25 16:41:11,192] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:41:11,193] [    INFO] -   Num examples = 1200
[2022-03-25 16:41:11,193] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:41:11,193] [    INFO] -   Total Batch size = 128
[2022-03-25 16:41:11,193] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.2228255271911621, 'eval_accuracy': 0.9216666666666666, 'eval_runtime': 1.6054, 'eval_samples_per_second': 747.496, 'eval_steps_per_second': 6.229, 'epoch': 1.33}                                                                                                                                                                         
 17%|██████████████████████▎                                                                                                               | 100/600 [00:34<02:30,  3.32it/s[2022-03-25 16:41:12,798] [    INFO] - Saving model checkpoint to tmp/checkpoint-100                                                                                          
{'loss': 0.1989, 'learning_rate': 4.9e-05, 'global_step': 110, 'epoch': 1.47}                                                                                                
{'loss': 0.2085, 'learning_rate': 4.8e-05, 'global_step': 120, 'epoch': 1.6}                                                                                                 
{'loss': 0.1784, 'learning_rate': 4.7e-05, 'global_step': 130, 'epoch': 1.73}                                                                                                
{'loss': 0.1926, 'learning_rate': 4.600000000000001e-05, 'global_step': 140, 'epoch': 1.87}                                                                                  
{'loss': 0.1856, 'learning_rate': 4.5e-05, 'global_step': 150, 'epoch': 2.0}                                                                                                 
{'loss': 0.1202, 'learning_rate': 4.4000000000000006e-05, 'global_step': 160, 'epoch': 2.13}                                                                                 
{'loss': 0.1087, 'learning_rate': 4.3e-05, 'global_step': 170, 'epoch': 2.27}                                                                                                
{'loss': 0.1077, 'learning_rate': 4.2e-05, 'global_step': 180, 'epoch': 2.4}                                                                                                 
{'loss': 0.1355, 'learning_rate': 4.1e-05, 'global_step': 190, 'epoch': 2.53}                                                                                                
{'loss': 0.1252, 'learning_rate': 4e-05, 'global_step': 200, 'epoch': 2.67}                                                                                                  
 33%|████████████████████████████████████████████▋                                                                                         | 200/600 [01:11<01:58,  3.37it/s][2022-03-25 16:41:50,060] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:41:50,060] [    INFO] -   Num examples = 1200
[2022-03-25 16:41:50,060] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:41:50,060] [    INFO] -   Total Batch size = 128
[2022-03-25 16:41:50,060] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.1823502629995346, 'eval_accuracy': 0.9341666666666667, 'eval_runtime': 1.6864, 'eval_samples_per_second': 711.591, 'eval_steps_per_second': 5.93, 'epoch': 2.67}                                                                                                                                                                          
 33%|████████████████████████████████████████████▋                                                                                         | 200/600 [01:13<01:58,  3.37it/s[2022-03-25 16:41:51,747] [    INFO] - Saving model checkpoint to tmp/checkpoint-200                                                                                          
{'loss': 0.1086, 'learning_rate': 3.9000000000000006e-05, 'global_step': 210, 'epoch': 2.8}                                                                                  
{'loss': 0.0988, 'learning_rate': 3.8e-05, 'global_step': 220, 'epoch': 2.93}                                                                                                
{'loss': 0.0968, 'learning_rate': 3.7e-05, 'global_step': 230, 'epoch': 3.07}                                                                                                
{'loss': 0.0649, 'learning_rate': 3.6e-05, 'global_step': 240, 'epoch': 3.2}                                                                                                 
{'loss': 0.0448, 'learning_rate': 3.5e-05, 'global_step': 250, 'epoch': 3.33}                                                                                                
{'loss': 0.0595, 'learning_rate': 3.4000000000000007e-05, 'global_step': 260, 'epoch': 3.47}                                                                                 
{'loss': 0.0509, 'learning_rate': 3.3e-05, 'global_step': 270, 'epoch': 3.6}                                                                                                 
{'loss': 0.06, 'learning_rate': 3.2000000000000005e-05, 'global_step': 280, 'epoch': 3.73}                                                                                   
{'loss': 0.0702, 'learning_rate': 3.1e-05, 'global_step': 290, 'epoch': 3.87}                                                                                                
{'loss': 0.0549, 'learning_rate': 3e-05, 'global_step': 300, 'epoch': 4.0}                                                                                                   
 50%|███████████████████████████████████████████████████████████████████                                                                   | 300/600 [01:49<00:48,  6.17it/s][2022-03-25 16:42:28,050] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:42:28,050] [    INFO] -   Num examples = 1200
[2022-03-25 16:42:28,050] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:42:28,050] [    INFO] -   Total Batch size = 128
[2022-03-25 16:42:28,050] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.1881168633699417, 'eval_accuracy': 0.94, 'eval_runtime': 1.6446, 'eval_samples_per_second': 729.682, 'eval_steps_per_second': 6.081, 'epoch': 4.0}           
 50%|███████████████████████████████████████████████████████████████████                                                                   | 300/600 [01:51<00:48,  6.17it/s[2022-03-25 16:42:29,696] [    INFO] - Saving model checkpoint to tmp/checkpoint-300                                                                                          
{'loss': 0.0435, 'learning_rate': 2.9e-05, 'global_step': 310, 'epoch': 4.13}                                                                                                
{'loss': 0.03, 'learning_rate': 2.8000000000000003e-05, 'global_step': 320, 'epoch': 4.27}                                                                                   
{'loss': 0.0304, 'learning_rate': 2.7000000000000002e-05, 'global_step': 330, 'epoch': 4.4}                                                                                  
{'loss': 0.0355, 'learning_rate': 2.6000000000000002e-05, 'global_step': 340, 'epoch': 4.53}                                                                                 
{'loss': 0.0345, 'learning_rate': 2.5e-05, 'global_step': 350, 'epoch': 4.67}                                                                                                
{'loss': 0.0266, 'learning_rate': 2.4e-05, 'global_step': 360, 'epoch': 4.8}                                                                                                 
{'loss': 0.0249, 'learning_rate': 2.3000000000000003e-05, 'global_step': 370, 'epoch': 4.93}                                                                                 
{'loss': 0.0237, 'learning_rate': 2.2000000000000003e-05, 'global_step': 380, 'epoch': 5.07}                                                                                 
{'loss': 0.0187, 'learning_rate': 2.1e-05, 'global_step': 390, 'epoch': 5.2}                                                                                                 
{'loss': 0.0149, 'learning_rate': 2e-05, 'global_step': 400, 'epoch': 5.33}                                                                                                  
 67%|█████████████████████████████████████████████████████████████████████████████████████████▎                                            | 400/600 [02:29<00:57,  3.51it/s][2022-03-25 16:43:07,458] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:43:07,459] [    INFO] -   Num examples = 1200
[2022-03-25 16:43:07,459] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:43:07,459] [    INFO] -   Total Batch size = 128
[2022-03-25 16:43:07,459] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.25193360447883606, 'eval_accuracy': 0.9425, 'eval_runtime': 1.8026, 'eval_samples_per_second': 665.703, 'eval_steps_per_second': 5.548, 'epoch': 5.33}       
 67%|█████████████████████████████████████████████████████████████████████████████████████████▎                                            | 400/600 [02:30<00:57,  3.51it/s[2022-03-25 16:43:09,262] [    INFO] - Saving model checkpoint to tmp/checkpoint-400                                                                                          
{'loss': 0.0231, 'learning_rate': 1.9e-05, 'global_step': 410, 'epoch': 5.47}                                                                                                
{'loss': 0.0154, 'learning_rate': 1.8e-05, 'global_step': 420, 'epoch': 5.6}                                                                                                 
{'loss': 0.0279, 'learning_rate': 1.7000000000000003e-05, 'global_step': 430, 'epoch': 5.73}                                                                                 
{'loss': 0.0194, 'learning_rate': 1.6000000000000003e-05, 'global_step': 440, 'epoch': 5.87}                                                                                 
{'loss': 0.0147, 'learning_rate': 1.5e-05, 'global_step': 450, 'epoch': 6.0}                                                                                                 
{'loss': 0.0085, 'learning_rate': 1.4000000000000001e-05, 'global_step': 460, 'epoch': 6.13}                                                                                 
{'loss': 0.0097, 'learning_rate': 1.3000000000000001e-05, 'global_step': 470, 'epoch': 6.27}                                                                                 
{'loss': 0.0112, 'learning_rate': 1.2e-05, 'global_step': 480, 'epoch': 6.4}                                                                                                 
{'loss': 0.0187, 'learning_rate': 1.1000000000000001e-05, 'global_step': 490, 'epoch': 6.53}                                                                                 
{'loss': 0.0106, 'learning_rate': 1e-05, 'global_step': 500, 'epoch': 6.67}                                                                                                  
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 500/600 [03:08<00:30,  3.25it/s][2022-03-25 16:43:46,542] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:43:46,543] [    INFO] -   Num examples = 1200
[2022-03-25 16:43:46,543] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:43:46,543] [    INFO] -   Total Batch size = 128
[2022-03-25 16:43:46,543] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.2666992247104645, 'eval_accuracy': 0.9475, 'eval_runtime': 1.6861, 'eval_samples_per_second': 711.688, 'eval_steps_per_second': 5.931, 'epoch': 6.67}        
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 500/600 [03:09<00:30,  3.25it/s[2022-03-25 16:43:48,230] [    INFO] - Saving model checkpoint to tmp/checkpoint-500                                                                                          
{'loss': 0.0181, 'learning_rate': 9e-06, 'global_step': 510, 'epoch': 6.8}                                                                                                   
{'loss': 0.015, 'learning_rate': 8.000000000000001e-06, 'global_step': 520, 'epoch': 6.93}                                                                                   
{'loss': 0.0064, 'learning_rate': 7.000000000000001e-06, 'global_step': 530, 'epoch': 7.07}                                                                                  
{'loss': 0.0128, 'learning_rate': 6e-06, 'global_step': 540, 'epoch': 7.2}                                                                                                   
{'loss': 0.0116, 'learning_rate': 5e-06, 'global_step': 550, 'epoch': 7.33}                                                                                                  
{'loss': 0.0089, 'learning_rate': 4.000000000000001e-06, 'global_step': 560, 'epoch': 7.47}                                                                                  
{'loss': 0.0096, 'learning_rate': 3e-06, 'global_step': 570, 'epoch': 7.6}                                                                                                   
{'loss': 0.0073, 'learning_rate': 2.0000000000000003e-06, 'global_step': 580, 'epoch': 7.73}                                                                                 
{'loss': 0.0149, 'learning_rate': 1.0000000000000002e-06, 'global_step': 590, 'epoch': 7.87}                                                                                 
{'loss': 0.006, 'learning_rate': 0.0, 'global_step': 600, 'epoch': 8.0}                                                                                                      
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:46<00:00,  6.34it/s][2022-03-25 16:44:24,858] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:44:24,858] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:24,858] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:24,859] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:24,859] [    INFO] -   Total prediction steps = 10
{'eval_loss': 0.28148436546325684, 'eval_accuracy': 0.9475, 'eval_runtime': 1.4858, 'eval_samples_per_second': 807.654, 'eval_steps_per_second': 6.73, 'epoch': 8.0}         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:47<00:00,  6.34it/s[2022-03-25 16:44:26,345] [    INFO] - Saving model checkpoint to tmp/checkpoint-600                                                                                          
[2022-03-25 16:44:33,450] [    INFO] - 
Training completed. 

[2022-03-25 16:44:33,450] [    INFO] - Loading best model from tmp/checkpoint-500 (score: 0.9475).
{'train_runtime': 235.7523, 'train_samples_per_second': 325.766, 'train_steps_per_second': 2.545, 'train_loss': 0.11529181957244873, 'epoch': 8.0}                           
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [03:55<00:00,  2.55it/s]
[2022-03-25 16:44:34,195] [    INFO] - Saving model checkpoint to tmp
***** train metrics *****
  epoch                    =        8.0
  train_loss               =     0.1153
  train_runtime            = 0:03:55.75
  train_samples_per_second =    325.766
  train_steps_per_second   =      2.545
[2022-03-25 16:44:36,504] [    INFO] - ***** Running Evaluation *****
[2022-03-25 16:44:36,505] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:36,505] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:36,505] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:36,505] [    INFO] -   Total prediction steps = 10
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 10.38it/s]
***** eval metrics *****
  epoch                   =        8.0
  eval_accuracy           =     0.9475
  eval_loss               =     0.2667
  eval_runtime            = 0:00:01.57
  eval_samples_per_second =     761.58
  eval_steps_per_second   =      6.347
[2022-03-25 16:44:38,080] [    INFO] - ***** Running Prediction *****
[2022-03-25 16:44:38,080] [    INFO] -   Num examples = 1200
[2022-03-25 16:44:38,080] [    INFO] -   Pre device batch size = 128
[2022-03-25 16:44:38,081] [    INFO] -   Total Batch size = 128
[2022-03-25 16:44:38,081] [    INFO] -   Total prediction steps = 10
 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 9/10 [00:00<00:00, 12.09it/s]***** test metrics *****
  test_accuracy           = 0.9591666666666666
  test_loss               =             0.1813
  test_runtime            =         0:00:01.65
  test_samples_per_second =             726.53
  test_steps_per_second   =              6.054
[2022-03-25 16:44:39,732] [    INFO] - Loading best model from tmp/checkpoint-500 (score: 0.9475).
[2022-03-25 16:44:40,530] [    INFO] - Exporting inference model to tmp/inference/infer
[2022-03-25 16:44:48,950] [    INFO] - Inference model exported.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.02s/it]
(base) root@yq01-inf-hic-k8s-a100-aa24-0417.yq01.baidu.com finetune $ 

更新日志

  • 3.22更新,初步支持fp16, O1, O2训练
  • 3.24更新,支持多卡训练,多卡评估,梯度累积功能
  • 3.25更新,支持预测,导出inference model

TODO:

  • check 多卡evalute正确性
  • check qa 任务效果正确性
  • check 训练效果

@ZeyuChen
Copy link
Member

Coding style CI没过。

wawltor
wawltor previously approved these changes Mar 25, 2022
Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI requested a review from wawltor March 28, 2022 04:52
datasets
tqdm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm是否一定要放在requirements里面了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议放一下,默认会 import tqdm。
这里 datasets 是 会依赖 tqdm的,实际上不加也可以。

from paddlenlp.metrics.squad import squad_evaluate, compute_prediction

from paddlenlp.trainer.trainer_base import TrainerBase
from paddlenlp.utils.log import logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看library增加一下新的logging模块,是否需要新增logging模块,是否在logger模块上进行优化升级了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除 新的logging模块, trainer 这边之前主要有些日志分级控制、重定向文件输出等能力,后续可以升级



@dataclass
class DataTrainingArguments:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的Arguments设置可以放到paddlenlp library里面,这样的话后续可以统一一些配置,例如max_seq_length这种设置在各个任务里面都可以直接使用

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方案是可行的。

但放库里面有一个问题,用户缺少自定义参数的示例。当用户要添加自定参数需求,会缺少指引

@@ -0,0 +1,171 @@
# Copyright 2020-present the HuggingFace Inc. team.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有个疑问,这个示例为什么要有HF的Copyright

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

return batchify_fn


class Dict(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在paddlenlp里面Dict class,这里区别是什么了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 dict 返回出来还是dict,原来paddlenlp里面返回为 list。

paddle.static.InputSpec(
shape=[None, None], dtype="int64") # segment_ids
]
trainer.export_model(input_spec=input_spec, load_best_model=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输出模型path可以让用户自定义

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加

@ZHUI ZHUI changed the title [PRETRAIN] Finetune for all tasks. [Train] PaddleNLP trainer and finetune ernie-1.0 pretrain. Mar 29, 2022
@ZHUI ZHUI changed the title [Train] PaddleNLP trainer and finetune ernie-1.0 pretrain. [Trainer] PaddleNLP trainer and finetune ernie-1.0 pretrain. Mar 29, 2022
Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI merged commit 44a290e into PaddlePaddle:develop Mar 30, 2022
ZeyuChen added a commit to ZeyuChen/PaddleNLP that referenced this pull request Apr 17, 2022
…r ernie-1.0 pretraining. (PaddlePaddle#1761)

* add some datasets for finetune.

* support fine tune for all tastks.

* add trainer prototype.

* init verison for paddlenlp trainer.

* refine trainer.

* update for some details.

* support multi-cards training evaluation.

* support load from ckpt.

* support for export inference model.

* first version of trainer.

* seq cls support clue.

* trainer support for token classification and question answersing tasks.

* fix as reviews.

Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants