Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Accelerate training by replacing DataContainer object scatter #1236

Closed
wants to merge 3 commits into from

Conversation

hhaAndroid
Copy link
Contributor

@hhaAndroid hhaAndroid commented Aug 2, 2021

Motivation

During the YOLOX reproduction process, we found that the scatter process of DataContainer will significantly increase the training time. The practice has shown that replacing the custom scatter can reduce the training time by about half.

Modification

Replace the custom scatter of the DataContainer object with the scatter of PyTorch.

BC-breaking

None

Use cases

I only tested MMDetection, the other frameworks did not test.

Note

I need to experiment and compare the training and inference speed.

@hhaAndroid hhaAndroid changed the title Speed up DataContainer object scatter Accelerate training by replacing DataContainer object scatter Aug 2, 2021
@codecov
Copy link

codecov bot commented Aug 2, 2021

Codecov Report

Merging #1236 (586e394) into master (285a052) will increase coverage by 0.02%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1236      +/-   ##
==========================================
+ Coverage   68.27%   68.29%   +0.02%     
==========================================
  Files         160      160              
  Lines       10599    10597       -2     
  Branches     1937     1937              
==========================================
+ Hits         7236     7237       +1     
+ Misses       2979     2976       -3     
  Partials      384      384              
Flag Coverage Δ
unittests 68.29% <33.33%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcv/parallel/_functions.py 16.32% <33.33%> (+2.60%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 285a052...586e394. Read the comment docs.

@zhouzaida zhouzaida requested a review from hellock August 4, 2021 12:41
@zhouzaida zhouzaida mentioned this pull request Aug 4, 2021
18 tasks
@kennymckormick
Copy link
Member

What's ur using case? Does that only work for stack=False DataContainers?

@hhaAndroid
Copy link
Contributor Author

What's ur using case? Does that only work for stack=False DataContainers?

His behavior is exactly the same as before, just replaced with the scatter of Pytorch itself.

@hhaAndroid hhaAndroid changed the title Accelerate training by replacing DataContainer object scatter [WIP] Accelerate training by replacing DataContainer object scatter Aug 5, 2021
@ZCMax
Copy link
Contributor

ZCMax commented Aug 10, 2021

I tried this PR in mmdetection3D on PointPillar:

Environment:
configs: hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py
mmcv_version: 1.3.8
GPUs: 4 V100

original implementations:

2021-08-09 23:18:34,975 - mmdet - INFO - Epoch [21][50/310] lr: 7.286e-03, eta: 3:40:21, time: 1.751, data_time: 0.679, memory: 5405, loss_cls: 0.1120, loss_bbox: 0.2178, loss_dir: 0.0244, loss: 0.3542, grad_norm: 0.7251
2021-08-09 23:19:50,129 - mmdet - INFO - Epoch [21][100/310] lr: 7.352e-03, eta: 3:41:41, time: 1.502, data_time: 0.229, memory: 5405, loss_cls: 0.1055, loss_bbox: 0.2167, loss_dir: 0.0245, loss: 0.3466, grad_norm: 0.8071
2021-08-09 23:20:10,814 - mmdet - INFO - Epoch [21][150/310] lr: 7.416e-03, eta: 3:40:21, time: 0.415, data_time: 0.022, memory: 5405, loss_cls: 0.1051, loss_bbox: 0.2159, loss_dir: 0.0231, loss: 0.3442, grad_norm: 0.7746
2021-08-09 23:20:52,095 - mmdet - INFO - Epoch [21][200/310] lr: 7.480e-03, eta: 3:40:01, time: 0.825, data_time: 0.259, memory: 5405, loss_cls: 0.1044, loss_bbox: 0.2087, loss_dir: 0.0224, loss: 0.3355, grad_norm: 0.6906
2021-08-09 23:21:08,705 - mmdet - INFO - Epoch [21][250/310] lr: 7.544e-03, eta: 3:38:30, time: 0.333, data_time: 0.017, memory: 5405, loss_cls: 0.1081, loss_bbox: 0.2156, loss_dir: 0.0216, loss: 0.3453, grad_norm: 0.7769
2021-08-09 23:21:24,320 - mmdet - INFO - Epoch [21][300/310] lr: 7.607e-03, eta: 3:36:58, time: 0.312, data_time: 0.028, memory: 5405, loss_cls: 0.1027, loss_bbox: 0.2125, loss_dir: 0.0217, loss: 0.3369, grad_norm: 0.7366
2021-08-09 23:21:27,592 - mmdet - INFO - Saving checkpoint at 21 epochs
2021-08-09 23:22:37,692 - mmdet - INFO - Epoch [22][50/310] lr: 7.683e-03, eta: 3:37:29, time: 1.387, data_time: 0.943, memory: 5405, loss_cls: 0.1130, loss_bbox: 0.2207, loss_dir: 0.0233, loss: 0.3570, grad_norm: 0.7560
2021-08-09 23:23:12,504 - mmdet - INFO - Epoch [22][100/310] lr: 7.745e-03, eta: 3:36:51, time: 0.696, data_time: 0.022, memory: 5418, loss_cls: 0.1025, loss_bbox: 0.2089, loss_dir: 0.0235, loss: 0.3349, grad_norm: 0.7138
2021-08-09 23:23:28,139 - mmdet - INFO - Epoch [22][150/310] lr: 7.806e-03, eta: 3:35:20, time: 0.314, data_time: 0.023, memory: 5418, loss_cls: 0.1057, loss_bbox: 0.2157, loss_dir: 0.0238, loss: 0.3453, grad_norm: 0.6475
2021-08-09 23:23:59,514 - mmdet - INFO - Epoch [22][200/310] lr: 7.867e-03, eta: 3:34:33, time: 0.626, data_time: 0.220, memory: 5418, loss_cls: 0.1049, loss_bbox: 0.2132, loss_dir: 0.0217, loss: 0.3398, grad_norm: 0.7137
2021-08-09 23:24:13,722 - mmdet - INFO - Epoch [22][250/310] lr: 7.927e-03, eta: 3:33:01, time: 0.285, data_time: 0.020, memory: 5418, loss_cls: 0.1047, loss_bbox: 0.2112, loss_dir: 0.0213, loss: 0.3373, grad_norm: 0.6982
2021-08-09 23:24:36,918 - mmdet - INFO - Epoch [22][300/310] lr: 7.987e-03, eta: 3:31:53, time: 0.464, data_time: 0.023, memory: 5418, loss_cls: 0.0999, loss_bbox: 0.2078, loss_dir: 0.0231, loss: 0.3307, grad_norm: 0.7557
2021-08-09 23:24:40,149 - mmdet - INFO - Saving checkpoint at 22 epochs
2021-08-09 23:25:24,172 - mmdet - INFO -
Car AP@0.70, 0.70, 0.70:
bbox AP:89.4112, 83.4955, 79.2196
bev AP:89.8205, 79.7134, 79.2923
3d AP:70.0940, 61.0183, 56.1177
aos AP:89.25, 82.82, 78.41
Car AP@0.70, 0.50, 0.50:
bbox AP:89.4112, 83.4955, 79.2196
bev AP:90.5806, 88.2620, 87.3935
3d AP:90.4840, 87.6858, 85.4203
aos AP:89.25, 82.82, 78.41

using this PR:

2021-08-10 01:44:01,949 - mmdet - INFO - Exp name: hv_pointpillars_secfpn_6x8_160e_kitti-3d-car.py
2021-08-10 01:44:01,950 - mmdet - INFO - Epoch(val) [20][943] KITTI/Car_3D_easy_strict: 71.4129, KITTI/Car_BEV_easy_strict: 89.4457, KITTI/Car_2D_easy_strict: 89.4210, KITTI/Car_3D_moderate_strict: 63.6602, KITTI/Car_BEV_moderate_strict: 83.4318, KITTI/Car_2D_moderate_strict: 85.5265, KITTI/Car_3D_hard_strict: 61.5196, KITTI/Car_BEV_hard_strict: 79.2837, KITTI/Car_2D_hard_strict: 79.6540, KITTI/Car_3D_easy_loose: 94.7463, KITTI/Car_BEV_easy_loose: 94.8973, KITTI/Car_2D_easy_loose: 89.4210, KITTI/Car_3D_moderate_loose: 88.8798, KITTI/Car_BEV_moderate_loose: 89.1970, KITTI/Car_2D_moderate_loose: 85.5265, KITTI/Car_3D_hard_loose: 87.9514, KITTI/Car_BEV_hard_loose: 88.6405, KITTI/Car_2D_hard_loose: 79.6540
2021-08-10 01:44:48,076 - mmdet - INFO - Epoch [21][50/310] lr: 7.286e-03, eta: 3:21:12, time: 0.902, data_time: 0.469, memory: 5405, loss_cls: 0.1121, loss_bbox: 0.2190, loss_dir: 0.0238, loss: 0.3548, grad_norm: 0.6942
2021-08-10 01:45:07,487 - mmdet - INFO - Epoch [21][100/310] lr: 7.352e-03, eta: 3:20:01, time: 0.388, data_time: 0.017, memory: 5405, loss_cls: 0.1060, loss_bbox: 0.2160, loss_dir: 0.0239, loss: 0.3460, grad_norm: 0.7869
2021-08-10 01:45:27,478 - mmdet - INFO - Epoch [21][150/310] lr: 7.416e-03, eta: 3:18:52, time: 0.400, data_time: 0.020, memory: 5405, loss_cls: 0.1057, loss_bbox: 0.2148, loss_dir: 0.0242, loss: 0.3447, grad_norm: 0.7293
2021-08-10 01:45:41,673 - mmdet - INFO - Epoch [21][200/310] lr: 7.480e-03, eta: 3:17:28, time: 0.284, data_time: 0.023, memory: 5405, loss_cls: 0.1050, loss_bbox: 0.2093, loss_dir: 0.0228, loss: 0.3370, grad_norm: 0.6972
2021-08-10 01:45:57,756 - mmdet - INFO - Epoch [21][250/310] lr: 7.544e-03, eta: 3:16:09, time: 0.321, data_time: 0.033, memory: 5405, loss_cls: 0.1088, loss_bbox: 0.2153, loss_dir: 0.0218, loss: 0.3458, grad_norm: 0.7547
2021-08-10 01:46:16,487 - mmdet - INFO - Epoch [21][300/310] lr: 7.607e-03, eta: 3:15:00, time: 0.375, data_time: 0.022, memory: 5405, loss_cls: 0.1016, loss_bbox: 0.2119, loss_dir: 0.0226, loss: 0.3361, grad_norm: 0.7507
2021-08-10 01:47:32,780 - mmdet - INFO - Epoch [22][50/310] lr: 7.683e-03, eta: 3:15:58, time: 1.464, data_time: 0.456, memory: 5405, loss_cls: 0.1135, loss_bbox: 0.2205, loss_dir: 0.0239, loss: 0.3578, grad_norm: 0.7754
2021-08-10 01:48:45,143 - mmdet - INFO - Epoch [22][100/310] lr: 7.745e-03, eta: 3:17:17, time: 1.447, data_time: 0.017, memory: 5418, loss_cls: 0.1020, loss_bbox: 0.2094, loss_dir: 0.0227, loss: 0.3341, grad_norm: 0.6982
2021-08-10 01:49:02,153 - mmdet - INFO - Epoch [22][150/310] lr: 7.806e-03, eta: 3:16:02, time: 0.339, data_time: 0.022, memory: 5418, loss_cls: 0.1056, loss_bbox: 0.2156, loss_dir: 0.0243, loss: 0.3455, grad_norm: 0.5944
2021-08-10 01:49:43,651 - mmdet - INFO - Epoch [22][200/310] lr: 7.867e-03, eta: 3:15:54, time: 0.830, data_time: 0.452, memory: 5418, loss_cls: 0.1060, loss_bbox: 0.2176, loss_dir: 0.0224, loss: 0.3459, grad_norm: 0.7582
2021-08-10 01:50:06,232 - mmdet - INFO - Epoch [22][250/310] lr: 7.927e-03, eta: 3:14:55, time: 0.451, data_time: 0.019, memory: 5418, loss_cls: 0.1046, loss_bbox: 0.2113, loss_dir: 0.0222, loss: 0.3382, grad_norm: 0.7089
2021-08-10 01:50:21,506 - mmdet - INFO - Epoch [22][300/310] lr: 7.987e-03, eta: 3:13:37, time: 0.307, data_time: 0.023, memory: 5418, loss_cls: 0.0999, loss_bbox: 0.2043, loss_dir: 0.0223, loss: 0.3264, grad_norm: 0.6924
2021-08-10 01:51:12,615 - mmdet - INFO -
Car AP@0.70, 0.70, 0.70:
bbox AP:89.6106, 83.8629, 79.3142
bev AP:89.9098, 80.0422, 79.3679
3d AP:72.0378, 62.7379, 56.7960
aos AP:89.37, 83.15, 78.34
Car AP@0.70, 0.50, 0.50:
bbox AP:89.6106, 83.8629, 79.3142
bev AP:90.5960, 88.8071, 87.5059
3d AP:90.5378, 88.3073, 85.5774
aos AP:89.37, 83.15, 78.34

The training speed seems faster and accuracy seems higher after using this PR.

@zhouzaida zhouzaida mentioned this pull request Aug 13, 2021
13 tasks
@hhaAndroid
Copy link
Contributor Author

We found that there is no need to modify it temporarily, so it is closed. If there are new developments in the follow-up, it will start again.

@hhaAndroid hhaAndroid closed this Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants