[Auto para] Relaunch with auto mapping function #37326

aoyulong · 2021-11-18T07:31:44Z

PR types

New features

PR changes

Others

Describe

The pr will relaunch the distributed training based on the rank mapping file produced by the auto mapping function. So there will be two times of launching.

The first launch will produce the rank mapping file based on the distributed programs and the cluster topology.
The second launch will use the rank mapping file to map the rank and the corresponding device for the actual distributed training.

… auto_para_mapping

…uto_para_mapping

…aunch

paddle-bot-old · 2021-11-18T07:31:49Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… auto_para_graph

… auto_para_launch

…to_para_launch

kuizhiqing · 2021-12-01T12:14:51Z

python/paddle/distributed/fleet/launch.py

+            original_args = sys.argv[1:]
+            os.environ["PADDLE_ORIGINAL_CMD_ARGS"] = " ".join(original_args)


this part looks fragile

This has been dealt with by shlex.split on line 154 of the above parallelizer.py.

kuizhiqing · 2021-12-01T12:20:17Z

python/paddle/distributed/fleet/launch.py

+        "--cluster_topo_path",
+        type=str,
+        default=None,
+        help="A json format file will be stored in this path which is used"
+        "to represent the cluster topology information for auto parallel.")
+    collective_group.add_argument(
+        "--rank_mapping_path",
+        type=str,
+        default=None,
+        help="A json format file will be stored in this path which is used"
+        "to map processes to machines for auto parallel.")


Add one more config file is expensive, is it possible to use xxx_config to hold all ? may be paddle_config and you hold some sections ?

The rank_mapping file will be automatically generated by our framework in the pre-launch analysis pass and must not be exposed to users.

… auto_para_launch

kuizhiqing

LGTM

XieYunshen

LGTM for set_tests_properties(test_auto_parallel_relaunch PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 120)

JZ-LIANG

LGTM

* [Auto Parallel] Add the unified cluster representation * [Auto Parallel] Add the graph class for physical mapping * [Auto Parallel] Add the simple physical mapper * Set the timeout of the mapper * Merge the upstream develop unittests cmake files * Fix a bug of the process group * Remove mapper unittest from platforms which is not GPU * Move the instantiation of process group after resharding * Add the local id for devices * Update the rank mapping format * [Auto Parallel] Relaunch with the rank mapping file * Remove the unnecessary json file * Avoid entering get_device_proc_info for auto mapping * Correct the mapper unit test * Add some comments * Remove the related files about mapping * Update the unittest for auto mapping * Remove unused rank_mapping unittest * Improve the unittest coverage * Improve the unittest coverage * Improve the unittest of relaunch * Fix the unittest problem in CI * Improve the unittest of relaunch * Remove unnecessary statements * Update the unittest cmakefile * Correct the cmakefile of auto parallel unittests * Modify codes based on the new elastic change * Use the GPUs exclusively in the unittest * Correct the cmakefile * Set the timeout of the unittest

aoyulong added 20 commits November 10, 2021 06:28

[Auto Parallel] Add the unified cluster representation

68b2c10

[Auto Parallel] Add the graph class for physical mapping

70e188a

[Auto Parallel] Add the simple physical mapper

1a44c06

Set the timeout of the mapper

b00f5fb

Merge the upstream develop unittests cmake files

76498be

Merge branch 'develop' into auto_para_mapping

b8a7be4

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

9127177

… auto_para_mapping

Fix a bug of the process group

2315472

Merge branch 'auto_para_mapping' of github.com:aoyulong/Paddle into a…

ab7bea4

…uto_para_mapping

Remove mapper unittest from platforms which is not GPU

8f3b236

Merge branch 'develop' into auto_para_mapping

72086ae

Move the instantiation of process group after resharding

95d6d3a

Merge branch 'develop' into auto_para_mapping

6402759

Merge branch 'auto_para_mapping' of github.com:aoyulong/Paddle into a…

6f1559d

…uto_para_mapping

Add the local id for devices

e50494f

Merge branch 'auto_para_cluster' into auto_para_mapping

14be54b

Update the rank mapping format

0ccb242

[Auto Parallel] Relaunch with the rank mapping file

4060856

Merge branch 'develop' of github.com:aoyulong/Paddle into auto_para_l…

c287b5a

…aunch

Remove the unnecessary json file

a0127f1

aoyulong added 9 commits November 18, 2021 08:51

Avoid entering get_device_proc_info for auto mapping

48936b8

Correct the mapper unit test

9cd37a6

Add some comments

7349999

Merge branch 'auto_para_cluster' into auto_para_mapping

d8647be

Remove the related files about mapping

11d41b4

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cb8de4c

… auto_para_graph

Update the unittest for auto mapping

f56cacf

Merge branch 'auto_para_mapping' into auto_para_launch

f36849e

Merge branch 'develop' into auto_para_graph

9cb742d

aoyulong added 10 commits November 30, 2021 01:56

Merge branch 'auto_para_mapping' into auto_para_launch

55870d2

Improve the unittest of relaunch

9e8cc18

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7b24059

… auto_para_launch

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e71ce76

… auto_para_launch

Fix the unittest problem in CI

fd8ff31

Merge branch 'develop' into auto_para_launch

df19fa2

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a65acab

… auto_para_launch

Merge branch 'auto_para_launch' of github.com:aoyulong/Paddle into au…

8002a63

…to_para_launch

Improve the unittest of relaunch

35828dd

Remove unnecessary statements

8d4199c

aoyulong force-pushed the auto_para_launch branch from 01b8b9c to 8d4199c Compare December 1, 2021 03:46

kuizhiqing reviewed Dec 1, 2021

View reviewed changes

aoyulong added 9 commits December 1, 2021 13:13

Update the unittest cmakefile

d2e3737

Correct the cmakefile of auto parallel unittests

3aef5c5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6040fea

… auto_para_launch

Modify codes based on the new elastic change

e746224

Use the GPUs exclusively in the unittest

ce25444

Correct the cmakefile

8706b24

Set the timeout of the unittest

9a23b7f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

31ef42c

… auto_para_launch

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6db885e

… auto_para_launch

kuizhiqing approved these changes Dec 6, 2021

View reviewed changes

XieYunshen approved these changes Dec 7, 2021

View reviewed changes

fuyinno4 approved these changes Dec 7, 2021

View reviewed changes

JZ-LIANG approved these changes Dec 7, 2021

View reviewed changes

JZ-LIANG merged commit 506e79d into PaddlePaddle:develop Dec 7, 2021

aoyulong deleted the auto_para_launch branch December 7, 2021 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto para] Relaunch with auto mapping function #37326

[Auto para] Relaunch with auto mapping function #37326

aoyulong commented Nov 18, 2021

paddle-bot-old bot commented Nov 18, 2021

kuizhiqing Dec 1, 2021

aoyulong Dec 3, 2021

kuizhiqing Dec 1, 2021

aoyulong Dec 3, 2021

kuizhiqing left a comment

XieYunshen left a comment

JZ-LIANG left a comment

		original_args = sys.argv[1:]
		os.environ["PADDLE_ORIGINAL_CMD_ARGS"] = " ".join(original_args)

[Auto para] Relaunch with auto mapping function #37326

[Auto para] Relaunch with auto mapping function #37326

Conversation

aoyulong commented Nov 18, 2021

PR types

PR changes

Describe

paddle-bot-old bot commented Nov 18, 2021

kuizhiqing Dec 1, 2021

Choose a reason for hiding this comment

aoyulong Dec 3, 2021

Choose a reason for hiding this comment

kuizhiqing Dec 1, 2021

Choose a reason for hiding this comment

aoyulong Dec 3, 2021

Choose a reason for hiding this comment

kuizhiqing left a comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

JZ-LIANG left a comment

Choose a reason for hiding this comment