-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Sharding]: update config DOC #32299
Merged
ForFishes
merged 3 commits into
PaddlePaddle:develop
from
JZ-LIANG:static/hybrid-parallelism/4d-doc
Apr 20, 2021
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -744,6 +744,8 @@ def sharding(self): | |
idea from [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054). | ||
Model parameters and Optimizer State are sharded into different ranks allowing to fit larger model. | ||
|
||
In Hybrid parallelism scenario, we use sharding config as uniform API to set each parallelism. | ||
|
||
Default value: False | ||
|
||
Examples: | ||
|
@@ -770,29 +772,51 @@ def sharding_configs(self): | |
Set sharding configurations. | ||
|
||
**Note**: | ||
fuse_broadcast_MB(float): size of a fused group of broadcasted parameters. | ||
This configuration will affect the communication speed in sharding training, | ||
and should be an empirical value decided by your model size and network topology. | ||
sharding_segment_strategy(string, optional): strategy used to segment the program(forward & backward operations). two strategise are | ||
available: "segment_broadcast_MB" and "segment_anchors". segment is a concept used in sharding to overlap computation and | ||
communication. Default is segment_broadcast_MB. | ||
|
||
segment_broadcast_MB(float, optional): segment by the parameters broadcast volume. sharding will introduce parameter broadcast operations into program, and | ||
after every segment_broadcast_MB size parameter being broadcasted, the program will be cutted into one segment. | ||
This configuration will affect the communication speed in sharding training, and should be an empirical value decided by your model size and network topology. | ||
Only enable when sharding_segment_strategy = segment_broadcast_MB. Default is 32.0 . | ||
|
||
segment_anchors(list): list of anchors used to segment the program, which allows a finner control of program segmentation. | ||
this strategy is experimental by now. Only enable when sharding_segment_strategy = segment_anchors. | ||
|
||
sharding_degree(int, optional): specific the number of gpus within each sharding parallelism group; and sharding will be turn off if sharding_degree=1. Default is 8. | ||
|
||
gradient_merge_acc_step(int, optional): specific the accumulation steps in gradient merge; and gradient merge will be turn off if gradient_merge_acc_step=1. Default is 1. | ||
|
||
hybrid_dp(bool): enable hybrid data parallelism above the sharding parallelism. | ||
you are supposed to have at least double the number of gpu you have in normal sharding | ||
training to enable this feature. | ||
optimize_offload(bool, optional): enable the optimizer offload which will offload the moment vars to Host memory in order to saving GPU memory for fitting larger model. | ||
the moment var will be prefetch from and offloaded to Host memory during update stage. it is a stragtegy that trades off between training speed and GPU memory, and is recommened to be turn on only when gradient_merge_acc_step large, where | ||
the number of time of update stage will be relatively small compared with forward&backward's. Default is False. | ||
|
||
dp_degree(int, optional): specific the number of data parallelism group; when dp_degree >= 2, it will introduce dp_degree ways data parallelism as the outer parallelsim for the inner parallelsim. User is responsible to ensure global_world_size = mp_degree * sharding_degree * pp_degree * dp_degree. Default is 1. | ||
|
||
mp_degree(int, optional): [Hybrid parallelism ONLY] specific the the number of gpus within each megatron parallelism group; and megatron parallelism will turn be off if mp_degree=1. Default is 1. | ||
|
||
pp_degree(int, optional): [Hybrid parallelism ONLY] specific the the number of gpus within each pipeline parallelism group; and pipeline parallelism will turn be off if pp_degree=1. Default is 1. | ||
|
||
pp_allreduce_in_optimize(bool, optional): [Hybrid parallelism ONLY] move the allreduce operations from backward stage to update(optimize) stage when pipeline parallelsim is on. | ||
This configuration will affect the communication speed of Hybrid parallelism training depeneded on network topology. this strategy is experimental by now.. Default is False. | ||
|
||
sharding_group_size(int): attribute of hybrid_dp. specific the the number of gpus within | ||
each sharding group; and therefore, the number of hybrid data parallelism ways will be equal | ||
to (global_size / sharding_group_size). | ||
|
||
Examples: | ||
|
||
.. code-block:: python | ||
|
||
# sharding-DP, 2 nodes with 8 gpus per node | ||
import paddle.distributed.fleet as fleet | ||
strategy = fleet.DistributedStrategy() | ||
strategy.sharding = True | ||
strategy.sharding_configs = { | ||
"fuse_broadcast_MB": 32, | ||
"hybrid_dp": True, | ||
"sharding_group_size": 8} | ||
"sharding_segment_strategy": "segment_broadcast_MB", | ||
"segment_broadcast_MB": 32, | ||
"sharding_degree": 8, | ||
"sharding_degree": 2, | ||
"gradient_merge_acc_step": 4, | ||
} | ||
""" | ||
return get_msg_dict(self.strategy.sharding_configs) | ||
|
||
|
@@ -845,7 +869,7 @@ def pipeline_configs(self): | |
**Notes**: | ||
**Detailed arguments for pipeline_configs** | ||
|
||
**micro_batch**: the number of small batches in each user defined batch | ||
**micro_batch_size**: the number of small batches in each user defined batch | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这一部分中文文档没有修改 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done~ |
||
|
||
Examples: | ||
|
||
|
@@ -854,7 +878,7 @@ def pipeline_configs(self): | |
import paddle.distributed.fleet as fleet | ||
strategy = fleet.DistributedStrategy() | ||
strategy.pipeline = True | ||
strategy.pipeline_configs = {"micro_batch": 12} | ||
strategy.pipeline_configs = {"micro_batch_size": 12} | ||
|
||
""" | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
segment_broadcast_MB
和segment_anchors
的概念需要介绍一下吧?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated