Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto para] Relaunch with auto mapping function #37326

Merged
merged 55 commits into from
Dec 7, 2021

Conversation

aoyulong
Copy link
Contributor

PR types

New features

PR changes

Others

Describe

The pr will relaunch the distributed training based on the rank mapping file produced by the auto mapping function. So there will be two times of launching.

  1. The first launch will produce the rank mapping file based on the distributed programs and the cluster topology.
  2. The second launch will use the rank mapping file to map the rank and the corresponding device for the actual distributed training.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Comment on lines +306 to +307
original_args = sys.argv[1:]
os.environ["PADDLE_ORIGINAL_CMD_ARGS"] = " ".join(original_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part looks fragile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been dealt with by shlex.split on line 154 of the above parallelizer.py.

Comment on lines +169 to +179
"--cluster_topo_path",
type=str,
default=None,
help="A json format file will be stored in this path which is used"
"to represent the cluster topology information for auto parallel.")
collective_group.add_argument(
"--rank_mapping_path",
type=str,
default=None,
help="A json format file will be stored in this path which is used"
"to map processes to machines for auto parallel.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one more config file is expensive, is it possible to use xxx_config to hold all ? may be paddle_config and you hold some sections ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rank_mapping file will be automatically generated by our framework in the pre-launch analysis pass and must not be exposed to users.

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_auto_parallel_relaunch PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 120)

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JZ-LIANG JZ-LIANG merged commit 506e79d into PaddlePaddle:develop Dec 7, 2021
@aoyulong aoyulong deleted the auto_para_launch branch December 7, 2021 04:14
Zjq9409 pushed a commit to Zjq9409/Paddle that referenced this pull request Dec 10, 2021
* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* [Auto Parallel] Relaunch with the rank mapping file

* Remove the unnecessary json file

* Avoid entering get_device_proc_info for auto mapping

* Correct the mapper unit test

* Add some comments

* Remove the related files about mapping

* Update the unittest for auto mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

* Improve the unittest coverage

* Improve the unittest of relaunch

* Fix the unittest problem in CI

* Improve the unittest of relaunch

* Remove unnecessary statements

* Update the unittest cmakefile

* Correct the cmakefile of auto parallel unittests

* Modify codes based on the new elastic change

* Use the GPUs exclusively in the unittest

* Correct the cmakefile

* Set the timeout of the unittest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants